ChemiQuery – Natural Language to SQL
A production-grade natural language to SQL system designed for chemical inventory management at Iowa State University's Environmental Health & Safety department. ChemiQuery bridges the gap between non-technical lab personnel and complex SQL Server databases by translating conversational questions into schema-aware SQL queries. The system features a fine-tuned T5-small model (~60M parameters) trained on 2,500+ examples combining public SQL datasets with 100+ curated domain-specific QA pairs. A LangGraph state machine orchestrates the inference pipeline with intelligent fallback to Llama 3.2-3b when primary generation fails, ensuring robust query execution across a 7-table relational schema tracking chemicals, hazards, locations, and inventory.
19x (0.05 → 0.996)
ROUGE Improvement
60M params (T5-small)
Model Size
2,500+ examples
Training Data
~95%
Query Accuracy
Tech Stack
12 technologies used
System Components
6 interconnected services
The architecture consists of these cooperating services, each with distinct responsibilities:
Fine-Tuned T5 Generator
backendPrimary SQL generation model: T5-small fine-tuned for 2 epochs on combined public and domain-specific datasets. Uses bfloat16 precision with schema-prefixed prompts for context-aware generation.
LangGraph Orchestrator
backendState machine coordinating the inference pipeline with 4 nodes: SQL generation, query execution, fallback handling, and answer generation. Manages state transitions and error recovery.
Fallback LLM
backendLlama 3.2-3b-instruct serving as recovery mechanism when fine-tuned model produces non-executable SQL. Receives error context and regenerates corrected queries.
SQL Server Database
database7-table relational schema for chemical inventory: Chem_Chemical, Chem_Hazard (30+ flags), Chem_Substance, Chem_Item, Lard_RoomAuth, Room_Inventory, Building. Stores hazard classifications, quantities, and location hierarchy.
Training Pipeline
infrastructureEnd-to-end fine-tuning workflow: multi-source dataset aggregation, tokenization with schema concatenation, Trainer API with AdamW optimizer, ROUGE evaluation, and model checkpointing.
Answer Generator
backendConverts raw SQL query results into natural language responses. Formats tabular data, handles empty results gracefully, and provides context-aware explanations.
Data Flow
10 steps from input to output
How data flows through the system from client request to final response:
Natural language question: 'List all toxic chemicals that are also corrosive'
Loads full 7-table schema context from context.txt
Constructs prompt: 'Tables:\n{schema}\n\nQuestion:\n{question}\n\nAnswer:\n'
Returns generated SQL query with JOIN and WHERE clauses
Executes generated SQL via SQLAlchemy/pyodbc connection
Returns query results or execution error
On error: sends original question + error message for SQL regeneration
Retries execution with corrected SQL query
Passes successful query results for natural language formatting
Returns formatted response: 'Found 23 chemicals that are both toxic and corrosive...'
Code Examples
4 ready-to-use snippets
Get started quickly with these practical code examples:
Natural Language Query
End-to-end query from natural language to results
# User asks a question in plain English
question = "List all chemicals that are both carcinogenic and flammable"
# LangGraph state machine processes the request
state = {
"question": question,
"query": "", # Will be filled by T5
"result": "", # Will be filled by SQL execution
"answer": "" # Final natural language response
}
# Execute the workflow
result = graph.invoke(state)
print(result["answer"])
# Output: "Found 12 chemicals matching your criteria:
# 1. Benzene (CAS: 71-43-2) - Location: Chemistry Building, Room 301
# 2. Formaldehyde (CAS: 50-00-0) - Location: Biology Lab, Room 105
# ..."Schema-Aware Prompt Template
Prompt structure for T5 fine-tuning with full schema context
Tables:
CREATE TABLE Chem_Chemical (
ChemicalId INT PRIMARY KEY,
Name NVARCHAR(255),
CASNumber NVARCHAR(50),
is_reviewed BIT,
is_active BIT
);
CREATE TABLE Chem_Hazard (
HazardId INT PRIMARY KEY,
ChemicalId INT FOREIGN KEY,
is_toxic BIT,
is_carcinogen BIT,
is_flammable BIT,
is_corrosive BIT,
NFPA_Health INT,
NFPA_Flammability INT,
-- 30+ additional hazard flags...
);
Question:
Find chemicals with NFPA health rating above 3 that haven't been reviewed
Answer:
SELECT c.Name, c.CASNumber, h.NFPA_Health
FROM Chem_Chemical c
JOIN Chem_Hazard h ON c.ChemicalId = h.ChemicalId
WHERE h.NFPA_Health > 3 AND c.is_reviewed = 0LangGraph State Machine
State machine definition with fallback logic
from langgraph.graph import StateGraph, END
from typing import TypedDict
class QueryState(TypedDict):
question: str
query: str
result: str
answer: str
# Define the workflow graph
workflow = StateGraph(QueryState)
# Add nodes
workflow.add_node("run_finetuned_llm", generate_sql_with_t5)
workflow.add_node("execute_query", run_sql_on_database)
workflow.add_node("fall_back", retry_with_llama)
workflow.add_node("generate_answer", format_response)
# Define edges with conditional routing
workflow.set_entry_point("run_finetuned_llm")
workflow.add_edge("run_finetuned_llm", "execute_query")
workflow.add_conditional_edges(
"execute_query",
lambda x: "fall_back" if "error" in x["result"] else "generate_answer"
)
workflow.add_edge("fall_back", "execute_query")
workflow.add_edge("generate_answer", END)
graph = workflow.compile()Training Configuration
Fine-tuning setup with Hugging Face Trainer
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./finetuned_model_2_epoch",
num_train_epochs=2,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=5e-3,
weight_decay=0.01,
evaluation_strategy="steps",
eval_steps=500,
logging_steps=50,
save_strategy="epoch",
load_best_model_at_end=True,
bf16=True, # bfloat16 for memory efficiency
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
compute_metrics=compute_rouge_metrics,
)
# Train and evaluate
trainer.train()
# Results: ROUGE-1: 0.996, ROUGE-L: 0.996Infrastructure
Production-ready deployment stack
Battle-tested infrastructure components for reliable operations:
Model Training Pipeline
deploymentEnd-to-end fine-tuning infrastructure: multi-source dataset aggregation from Hugging Face Hub, tokenization with schema concatenation, Trainer API with AdamW optimizer, and ROUGE-based evaluation. Supports distributed training and model checkpointing.
LangGraph Orchestration
gatewayState machine framework managing the inference pipeline. Handles state transitions between SQL generation, execution, fallback, and answer formatting. Supports conditional routing based on query success/failure.
Database Integration
storageSQL Server connectivity via pyodbc with ODBC Driver 17. SQLAlchemy ORM for query isolation and connection pooling. Supports the full 7-table schema with 30+ hazard classification columns.
Multi-Source Dataset Pipeline
monitoringAggregates training data from 3 public Hugging Face datasets (b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know_sql) plus 100+ curated domain-specific QA pairs. 80/10/10 train/val/test split.
Key Features
- Fine-tuned T5-small (60M params) with bfloat16 precision for efficient GPU inference
- 19x ROUGE score improvement: 0.05 (zero-shot baseline) → 0.996 (fine-tuned) on domain queries
- Multi-source training data: 3 public SQL datasets + 100+ curated chemical inventory QA pairs
- LangGraph state machine with 4 nodes: generate → execute → fallback → answer
- Intelligent fallback to Llama 3.2-3b-instruct when primary SQL generation fails execution
- Schema-aware prompting with full 7-table context (Chem_Chemical, Chem_Hazard, Chem_Item, Room_Inventory, Building)
- 30+ hazard flag support: carcinogen, corrosive, flammable, radioactive, NFPA ratings, and more
- ~5 second end-to-end latency (generation + SQL execution + natural language response)
- Complex query support: JOINs, aggregations, window functions, multi-condition filters
Challenges Solved
- Creating high-quality domain-specific training data for chemical safety terminology and complex hazard relationships
- Balancing schema context size (7 tables, 30+ hazard columns) against tokenization limits (512 tokens)
- Handling ambiguous natural language that could map to multiple valid SQL interpretations
- Ensuring generated SQL is safe: read-only queries, no destructive operations, injection-resistant
- Designing fallback logic that recovers gracefully from execution errors without user intervention
Outcomes & Impact
- Enabled non-technical lab staff to query complex data: 'List reviewed chemicals with multiple health risk indicators'
- ~95% query accuracy on held-out chemical inventory questions with syntactically correct SQL
- Reduced bottleneck for compliance officers accessing hazard and location data for safety audits
- Demonstrated effective domain adaptation: public SQL datasets + focused domain data = specialized model
- Production-ready architecture with automatic retry, error handling, and natural language response generation