Create a SQL REPL for JSON Files in Python

When working with JSON data, it’s common to need quick exploratory queries without writing a full application. By combining Pandas for data handling, DuckDB for SQL querying, and a few Python modules to help make our lives a little easier, we can create an effective SQL REPL (Read–Eval–Print Loop) for JSON in 5 coding steps, and with relative ease.

Step 1: Command-Line Argument Parsing

We start by using Python’s argparse module to accept the JSON file path as a command-line parameter. This makes our script flexible, allowing us to query different JSON files without modifying the code.

import argparse

# Set up the argument parser with a description and a required parameter for the JSON file path
parser = argparse.ArgumentParser(description='Load JSON file and start an SQL REPL for querying it using DuckDB.')
parser.add_argument('json_file', help='Path to the JSON file')
args = parser.parse_args()

Step 2: Loading and Preparing the JSON Data

We load the JSON file using Pandas. Since our JSON structure uses keys as row identifiers, we specify orient='index' so that each key becomes a row. We then reset the index and rename the column to word for clarity. This transformation makes our data easier to query using SQL.

import pandas as pd

# Load the JSON file, treating keys as rows and then reset the index so keys become a column
df = pd.read_json(args.json_file, orient='index').reset_index()
df.rename(columns={'index': 'word'}, inplace=True)

Step 3: Setting Up DuckDB for SQL Querying

DuckDB is a lightweight, in-process SQL engine that works seamlessly with Pandas DataFrames. We create an in-memory DuckDB connection and register our DataFrame as a table named df. This registration allows us to write standard SQL queries against the data.

import duckdb

# Connect to DuckDB (in-memory) and register the DataFrame as a table named 'df'
con = duckdb.connect()
con.register('df', df)

Step 4: Enhancing the CLI with Command History

To mimic a typical command-line shell (like Bash), we incorporate the readline library. This module lets users navigate their command history using the up and down arrow keys. Additionally, we configure persistent history by saving commands to a file (in this case, .sql_repl_history).

import readline

# Set up a history file for persistent command history
history_file = ".sql_repl_history"
try:
    readline.read_history_file(history_file)
except FileNotFoundError:
    pass  # First run; no history file exists yet

Step 5: Building the REPL Loop

The REPL loop is the core of our CLI. It continuously prompts the user for SQL commands, executes them using DuckDB, and prints the results. We also add each valid query to the command history and handle exceptions gracefully.

# REPL loop for SQL queries using DuckDB with history support
while True:
    try:
        query = input("SQL> ")
        if query.lower() in ['exit', 'quit']:
            break
        if query.strip():  # Save only non-empty queries to history
            readline.add_history(query)
        
        result = con.execute(query).fetchdf()
        print(result)
    except Exception as e:
        print("Error:", e)

# Save command history upon exiting
readline.write_history_file(history_file)

Full Script

Below is the complete script that integrates all the components discussed above. You can run this locally, pass a JSON file as a parameter, and start exploring your data using SQL commands.

import pandas as pd
import duckdb
import argparse
import readline  # Enables command history and arrow-key navigation

# Set up the argument parser with a description and JSON file parameter
parser = argparse.ArgumentParser(description='Load JSON file and start an SQL REPL for querying it using DuckDB.')
parser.add_argument('json_file', help='Path to the JSON file')
args = parser.parse_args()

# Load the JSON file with orient='index' to treat keys as rows, then reset index so keys become a column
df = pd.read_json(args.json_file, orient='index').reset_index()
df.rename(columns={'index': 'word'}, inplace=True)

# Connect to DuckDB (in-memory) and register the DataFrame as a table named 'df'
con = duckdb.connect()
con.register('df', df)

# Set up a history file for command history using readline
history_file = ".sql_repl_history"
try:
    readline.read_history_file(history_file)
except FileNotFoundError:
    pass  # First run; no history file exists yet

# REPL loop for SQL queries using DuckDB with history support
while True:
    try:
        query = input("SQL> ")
        if query.lower() in ['exit', 'quit']:
            break
        if query.strip():  # Only add non-empty queries to history
            readline.add_history(query)
        
        result = con.execute(query).fetchdf()
        print(result)
    except Exception as e:
        print("Error:", e)

# Save command history on exit
readline.write_history_file(history_file)

Save the script as json_repl.py. You can then use it by calling:

python json_reply.py sample.json

You should then see something like this, after which you can interact with your JSON file:

SQL> show tables
  name
0   df
SQL> select * from df
        word  total_count                           original_forms   pos         en
0      apple          500            {'apple': 400, 'apples': 100}  NOUN      apple
1        run          350  {'run': 200, 'running': 100, 'ran': 50}  VERB        run
2    quickly          275                         {'quickly': 275}   ADV    quickly
3  beautiful          180                       {'beautiful': 180}   ADJ  beautiful
4       they          600                            {'they': 600}  PRON       they
SQL>

Conclusion

This tutorial demonstrates how to build a CLI for querying JSON data using SQL syntax. By leveraging Pandas, DuckDB, argparse, and readline, we created a robust and interactive tool that simplifies data exploration. This approach is ideal for professionals who require quick insights from JSON files without the overhead of building a full-fledged application.