Back to Projects
Machine LearningAug 2024 – Dec 2024Featured

ChemiQuery – Natural Language to SQL

A production-grade natural language to SQL system designed for chemical inventory management at Iowa State University's Environmental Health & Safety department. ChemiQuery bridges the gap between non-technical lab personnel and complex SQL Server databases by translating conversational questions into schema-aware SQL queries. The system features a fine-tuned T5-small model (~60M parameters) trained on 2,500+ examples combining public SQL datasets with 100+ curated domain-specific QA pairs. A LangGraph state machine orchestrates the inference pipeline with intelligent fallback to Llama 3.2-3b when primary generation fails, ensuring robust query execution across a 7-table relational schema tracking chemicals, hazards, locations, and inventory.

19x (0.05 → 0.996)

ROUGE Improvement

60M params (T5-small)

Model Size

2,500+ examples

Training Data

~95%

Query Accuracy

Tech Stack

12 technologies used

PythonPyTorchTransformersT5-smallLangChainLangGraphSQL ServerHugging Face DatasetsROUGE Evaluationbfloat16SQLAlchemypyodbc

System Components

6 interconnected services

The architecture consists of these cooperating services, each with distinct responsibilities:

Fine-Tuned T5 Generator

backend

Primary SQL generation model: T5-small fine-tuned for 2 epochs on combined public and domain-specific datasets. Uses bfloat16 precision with schema-prefixed prompts for context-aware generation.

Protocol:Local Inference
T5-smallTransformersPyTorchbfloat16

LangGraph Orchestrator

backend

State machine coordinating the inference pipeline with 4 nodes: SQL generation, query execution, fallback handling, and answer generation. Manages state transitions and error recovery.

Protocol:State Machine
LangGraphLangChainTypedDict

Fallback LLM

backend

Llama 3.2-3b-instruct serving as recovery mechanism when fine-tuned model produces non-executable SQL. Receives error context and regenerates corrected queries.

Protocol:HTTP/REST
Llama 3.2-3bOpenAI-compatible APILM Studio

SQL Server Database

database
:1433

7-table relational schema for chemical inventory: Chem_Chemical, Chem_Hazard (30+ flags), Chem_Substance, Chem_Item, Lard_RoomAuth, Room_Inventory, Building. Stores hazard classifications, quantities, and location hierarchy.

Protocol:TDS
SQL ServerpyodbcODBC Driver 17SQLAlchemy

Training Pipeline

infrastructure

End-to-end fine-tuning workflow: multi-source dataset aggregation, tokenization with schema concatenation, Trainer API with AdamW optimizer, ROUGE evaluation, and model checkpointing.

Hugging Face TrainerDatasetsROUGEevaluate

Answer Generator

backend

Converts raw SQL query results into natural language responses. Formats tabular data, handles empty results gracefully, and provides context-aware explanations.

Protocol:Internal
LangChainPrompt Templates

Data Flow

10 steps from input to output

How data flows through the system from client request to final response:

1
UserLangGraph OrchestratorInput

Natural language question: 'List all toxic chemicals that are also corrosive'

2
LangGraph OrchestratorSchema LoaderFile I/O

Loads full 7-table schema context from context.txt

3
Schema LoaderT5 GeneratorInternal

Constructs prompt: 'Tables:\n{schema}\n\nQuestion:\n{question}\n\nAnswer:\n'

4
T5 GeneratorLangGraph OrchestratorTensor → String

Returns generated SQL query with JOIN and WHERE clauses

5
LangGraph OrchestratorSQL ServerTDS

Executes generated SQL via SQLAlchemy/pyodbc connection

6
SQL ServerLangGraph OrchestratorTDS

Returns query results or execution error

7
LangGraph OrchestratorFallback LLMHTTP

On error: sends original question + error message for SQL regeneration

8
Fallback LLMSQL ServerTDS

Retries execution with corrected SQL query

9
LangGraph OrchestratorAnswer GeneratorInternal

Passes successful query results for natural language formatting

10
Answer GeneratorUserOutput

Returns formatted response: 'Found 23 chemicals that are both toxic and corrosive...'

Code Examples

4 ready-to-use snippets

Get started quickly with these practical code examples:

PYTHON

Natural Language Query

End-to-end query from natural language to results

# User asks a question in plain English
question = "List all chemicals that are both carcinogenic and flammable"

# LangGraph state machine processes the request
state = {
    "question": question,
    "query": "",      # Will be filled by T5
    "result": "",     # Will be filled by SQL execution
    "answer": ""      # Final natural language response
}

# Execute the workflow
result = graph.invoke(state)
print(result["answer"])
# Output: "Found 12 chemicals matching your criteria:
#          1. Benzene (CAS: 71-43-2) - Location: Chemistry Building, Room 301
#          2. Formaldehyde (CAS: 50-00-0) - Location: Biology Lab, Room 105
#          ..."
TEXT

Schema-Aware Prompt Template

Prompt structure for T5 fine-tuning with full schema context

Tables:
CREATE TABLE Chem_Chemical (
  ChemicalId INT PRIMARY KEY,
  Name NVARCHAR(255),
  CASNumber NVARCHAR(50),
  is_reviewed BIT,
  is_active BIT
);

CREATE TABLE Chem_Hazard (
  HazardId INT PRIMARY KEY,
  ChemicalId INT FOREIGN KEY,
  is_toxic BIT,
  is_carcinogen BIT,
  is_flammable BIT,
  is_corrosive BIT,
  NFPA_Health INT,
  NFPA_Flammability INT,
  -- 30+ additional hazard flags...
);

Question:
Find chemicals with NFPA health rating above 3 that haven't been reviewed

Answer:
SELECT c.Name, c.CASNumber, h.NFPA_Health
FROM Chem_Chemical c
JOIN Chem_Hazard h ON c.ChemicalId = h.ChemicalId
WHERE h.NFPA_Health > 3 AND c.is_reviewed = 0
PYTHON

LangGraph State Machine

State machine definition with fallback logic

from langgraph.graph import StateGraph, END
from typing import TypedDict

class QueryState(TypedDict):
    question: str
    query: str
    result: str
    answer: str

# Define the workflow graph
workflow = StateGraph(QueryState)

# Add nodes
workflow.add_node("run_finetuned_llm", generate_sql_with_t5)
workflow.add_node("execute_query", run_sql_on_database)
workflow.add_node("fall_back", retry_with_llama)
workflow.add_node("generate_answer", format_response)

# Define edges with conditional routing
workflow.set_entry_point("run_finetuned_llm")
workflow.add_edge("run_finetuned_llm", "execute_query")
workflow.add_conditional_edges(
    "execute_query",
    lambda x: "fall_back" if "error" in x["result"] else "generate_answer"
)
workflow.add_edge("fall_back", "execute_query")
workflow.add_edge("generate_answer", END)

graph = workflow.compile()
PYTHON

Training Configuration

Fine-tuning setup with Hugging Face Trainer

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./finetuned_model_2_epoch",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=5e-3,
    weight_decay=0.01,
    evaluation_strategy="steps",
    eval_steps=500,
    logging_steps=50,
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,  # bfloat16 for memory efficiency
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_rouge_metrics,
)

# Train and evaluate
trainer.train()
# Results: ROUGE-1: 0.996, ROUGE-L: 0.996

Infrastructure

Production-ready deployment stack

Battle-tested infrastructure components for reliable operations:

Model Training Pipeline

deployment

End-to-end fine-tuning infrastructure: multi-source dataset aggregation from Hugging Face Hub, tokenization with schema concatenation, Trainer API with AdamW optimizer, and ROUGE-based evaluation. Supports distributed training and model checkpointing.

Hugging Face TrainerDatasetsevaluateROUGEbfloat16

LangGraph Orchestration

gateway

State machine framework managing the inference pipeline. Handles state transitions between SQL generation, execution, fallback, and answer formatting. Supports conditional routing based on query success/failure.

LangGraphLangChainTypedDictStateGraph

Database Integration

storage

SQL Server connectivity via pyodbc with ODBC Driver 17. SQLAlchemy ORM for query isolation and connection pooling. Supports the full 7-table schema with 30+ hazard classification columns.

SQL ServerpyodbcSQLAlchemyODBC Driver 17

Multi-Source Dataset Pipeline

monitoring

Aggregates training data from 3 public Hugging Face datasets (b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know_sql) plus 100+ curated domain-specific QA pairs. 80/10/10 train/val/test split.

Hugging Face DatasetsJSONTrain/Val/Test Split

Key Features

  • Fine-tuned T5-small (60M params) with bfloat16 precision for efficient GPU inference
  • 19x ROUGE score improvement: 0.05 (zero-shot baseline) → 0.996 (fine-tuned) on domain queries
  • Multi-source training data: 3 public SQL datasets + 100+ curated chemical inventory QA pairs
  • LangGraph state machine with 4 nodes: generate → execute → fallback → answer
  • Intelligent fallback to Llama 3.2-3b-instruct when primary SQL generation fails execution
  • Schema-aware prompting with full 7-table context (Chem_Chemical, Chem_Hazard, Chem_Item, Room_Inventory, Building)
  • 30+ hazard flag support: carcinogen, corrosive, flammable, radioactive, NFPA ratings, and more
  • ~5 second end-to-end latency (generation + SQL execution + natural language response)
  • Complex query support: JOINs, aggregations, window functions, multi-condition filters

Challenges Solved

  • Creating high-quality domain-specific training data for chemical safety terminology and complex hazard relationships
  • Balancing schema context size (7 tables, 30+ hazard columns) against tokenization limits (512 tokens)
  • Handling ambiguous natural language that could map to multiple valid SQL interpretations
  • Ensuring generated SQL is safe: read-only queries, no destructive operations, injection-resistant
  • Designing fallback logic that recovers gracefully from execution errors without user intervention

Outcomes & Impact

  • Enabled non-technical lab staff to query complex data: 'List reviewed chemicals with multiple health risk indicators'
  • ~95% query accuracy on held-out chemical inventory questions with syntactically correct SQL
  • Reduced bottleneck for compliance officers accessing hazard and location data for safety audits
  • Demonstrated effective domain adaptation: public SQL datasets + focused domain data = specialized model
  • Production-ready architecture with automatic retry, error handling, and natural language response generation