Step 1: Command-Line Argument Parsing
We start by using Python’s argparse
module to accept the JSON file path as a command-line parameter. This makes our script flexible, allowing us to query different JSON files without modifying the code.
import argparse
# Set up the argument parser with a description and a required parameter for the JSON file path
parser = argparse.ArgumentParser(description='Load JSON file and start an SQL REPL for querying it using DuckDB.')
parser.add_argument('json_file', help='Path to the JSON file')
args = parser.parse_args()
Step 2: Loading and Preparing the JSON Data
We load the JSON file using Pandas. Since our JSON structure uses keys as row identifiers, we specify orient='index'
so that each key becomes a row. We then reset the index and rename the column to word
for clarity. This transformation makes our data easier to query using SQL.
import pandas as pd
# Load the JSON file, treating keys as rows and then reset the index so keys become a column
df = pd.read_json(args.json_file, orient='index').reset_index()
df.rename(columns={'index': 'word'}, inplace=True)
Step 3: Setting Up DuckDB for SQL Querying
DuckDB is a lightweight, in-process SQL engine that works seamlessly with Pandas DataFrames. We create an in-memory DuckDB connection and register our DataFrame as a table named df
. This registration allows us to write standard SQL queries against the data.
import duckdb
# Connect to DuckDB (in-memory) and register the DataFrame as a table named 'df'
con = duckdb.connect()
con.register('df', df)
Step 4: Enhancing the CLI with Command History
To mimic a typical command-line shell (like Bash), we incorporate the readline
library. This module lets users navigate their command history using the up and down arrow keys. Additionally, we configure persistent history by saving commands to a file (in this case, .sql_repl_history
).
import readline
# Set up a history file for persistent command history
history_file = ".sql_repl_history"
try:
readline.read_history_file(history_file)
except FileNotFoundError:
pass # First run; no history file exists yet
Step 5: Building the REPL Loop
The REPL loop is the core of our CLI. It continuously prompts the user for SQL commands, executes them using DuckDB, and prints the results. We also add each valid query to the command history and handle exceptions gracefully.
# REPL loop for SQL queries using DuckDB with history support
while True:
try:
query = input("SQL> ")
if query.lower() in ['exit', 'quit']:
break
if query.strip(): # Save only non-empty queries to history
readline.add_history(query)
result = con.execute(query).fetchdf()
print(result)
except Exception as e:
print("Error:", e)
# Save command history upon exiting
readline.write_history_file(history_file)
Full Script
Below is the complete script that integrates all the components discussed above. You can run this locally, pass a JSON file as a parameter, and start exploring your data using SQL commands.
import pandas as pd
import duckdb
import argparse
import readline # Enables command history and arrow-key navigation
# Set up the argument parser with a description and JSON file parameter
parser = argparse.ArgumentParser(description='Load JSON file and start an SQL REPL for querying it using DuckDB.')
parser.add_argument('json_file', help='Path to the JSON file')
args = parser.parse_args()
# Load the JSON file with orient='index' to treat keys as rows, then reset index so keys become a column
df = pd.read_json(args.json_file, orient='index').reset_index()
df.rename(columns={'index': 'word'}, inplace=True)
# Connect to DuckDB (in-memory) and register the DataFrame as a table named 'df'
con = duckdb.connect()
con.register('df', df)
# Set up a history file for command history using readline
history_file = ".sql_repl_history"
try:
readline.read_history_file(history_file)
except FileNotFoundError:
pass # First run; no history file exists yet
# REPL loop for SQL queries using DuckDB with history support
while True:
try:
query = input("SQL> ")
if query.lower() in ['exit', 'quit']:
break
if query.strip(): # Only add non-empty queries to history
readline.add_history(query)
result = con.execute(query).fetchdf()
print(result)
except Exception as e:
print("Error:", e)
# Save command history on exit
readline.write_history_file(history_file)
Save the script as json_repl.py
. You can then use it by calling:
python json_reply.py sample.json
You should then see something like this, after which you can interact with your JSON file:
SQL> show tables
name
0 df
SQL> select * from df
word total_count original_forms pos en
0 apple 500 {'apple': 400, 'apples': 100} NOUN apple
1 run 350 {'run': 200, 'running': 100, 'ran': 50} VERB run
2 quickly 275 {'quickly': 275} ADV quickly
3 beautiful 180 {'beautiful': 180} ADJ beautiful
4 they 600 {'they': 600} PRON they
SQL>
Conclusion
This tutorial demonstrates how to build a CLI for querying JSON data using SQL syntax. By leveraging Pandas, DuckDB, argparse
, and readline
, we created a robust and interactive tool that simplifies data exploration. This approach is ideal for professionals who require quick insights from JSON files without the overhead of building a full-fledged application.