Abstract
As AI developers, we’re always looking for ways to make data more accessible and queryable through natural language. While Retrieval-Augmented Generation (RAG) has revolutionized how we interact with unstructired textual documents, it falls short when dealing with structured data. The RAG approach is so powerful that users or even early stage AI developers may fall in the illusion that it can be applied to any kind of data, including structured data like Excel files. However, this is a misconception that can lead to frustration and inefficiency. One of most ubiquitous kind of file asset across all organization is the Excel file format, which could also be considered as structured or “semi-structured” at least. Anyone who has tryed to process an Excel file using the standard Rag approach, quickly realized there is no real value with processing excel files the same way as PDFs.
In this post, I’ll share how I built a system that combines some prompting techniques to create a powerful Excel analysis tool based on SQL.
All the code is available on GitHub
Why RAG Doesn’t Work with Excel Files
RAG has been designed for enrich the LLM prompt with unstructured text from a large corpus of documents. To identify meaningful text chunks, semantic similarity it’s used and here’s why it struggles with Excel data:
1. The RAG Architecture
- RAG Architecture is designed for unstructured text, where semantic similarity is key, but Excel data requires exact matches, aggregations, and complex relationships that semantic search simply can’t provide.
- RAG treats everything as flat text, missing essential contextual clues like column headers, data types, and relationships
- The chunking process in RAG breaks down the inherent structure of Excel files, leading to loss of critical information.
2. The nature of data in Excel files
- Excel files are inherently structured or semi-structured, with rows and columns that define relationships between data points.
- Most likely, the user’s intent is not to extract information in a “narrative” way, but rather to perform calculations, aggregations, and statistical analyses.
- In structured data, column headers, data types, and relationships between tables are critical. RAG treats everything as flat text, missing these essential contextual clues.
The Solution: LLM-Powered Excel-to-SQL Pipeline
LLMs capabilities are growing rapidly and today one of the most proficient area is code generation, especially when it comes to SQL queries. So, instead of trying to force RAG into a structured data world, I’ve built a system that embraces the structured nature of Excel files and uses LLMs to convert the excel data into a SQL database schema. This allows us to leverage the power of SQL for querying and analyzing the data.
Here’s the architecture:
As for the DB, I used SQLite for simplicity, but this architecture can be adapted to any SQL database. As for the LLM, I used OpenAI’s gpt-4.1-mini, but you can use any comparable LLM.
System Components
The pipeline consists of four main components:
1. Metadata Analyzer
Uses an LLM to analyze sheet names and column headers, inferring the purpose and structure of the data:
metadata_prompt = PromptTemplate(
input_variables=["sheet_name", "columns"],
partial_variables={
"format_instructions": metadata_parser.get_format_instructions()
},
template="""
Analyze the following metadata of an Excel sheet:
Sheet name: {sheet_name}
Columns: {columns}
Based on the column names, please return the following information:
- suggested_table_name
- description of the table content
- primary_key_column_name_if_present
- category (e.g., financial, sales, inventory, etc.)
{format_instructions}
Use table names in snake_case and in English.
"""
)
2. Type Detection Engine
Combines LLM analysis with statistical sampling to determine the correct SQL data types:
At the end of this process, the system generates a SQL schema that accurately represents the data types and relationships in the Excel file and executes it to create the table in the database.
3. SQL Generator
Converts natural language questions into SQL queries using the database schema as context:
sql_prompt = PromptTemplate(
input_variables=["question", "schema", "sample_data"],
template="""
Generate an SQL query to answer the following question:
Question: {question}
Database schema:
{schema}
Sample data:
{sample_data}
Generate ONLY the SQL query without any additional explanations.
Use standard SQLite syntax.
"""
)
4. Query Executor
Executes the generated SQL and formats results for presentation.
Real-World Example: ETF Portfolio Analysis
Let me walk you through a concrete example using an XTrackers ETF holdings composition as example file. The Excel file is pretty simple and contains the breakdown of the “Xtrackers MSCI World ex USA UCITS” ETF, with related underlying stocks, their market value, weight in the portfolio, and sector classification.
Input Excel File Structure
ID | Name | ISIN | Country | Currency | Exchange | Type of Security | Rating | Industry Classification | Weighting |
---|---|---|---|---|---|---|---|---|---|
1 | SAP | DE0007164600 | Germany | EUR | XETRA | Equity | - | Information Technology | 1.47% |
2 | ASML HOLDING NV | NL0010273215 | Netherlands | EUR | Euronext Amsterdam | Equity | Baa2 | Information Technology | 1.46% |
3 | NESTLE SA | CH0038863350 | Switzerland | CHF | Scoach Switzerland | Equity | Aa2 | Consumer Staples | 1.22% |
4 | NOVARTIS AG | CH0012005267 | Switzerland | CHF | Scoach Switzerland | Equity | Aa3 | Health Care | 1.08% |
5 | ROCHE HOLDING PAR AG | CH0012032048 | Switzerland | CHF | Scoach Switzerland | Equity | A1 | Health Care | 1.06% |
6 | … | ….. | …. | … | … | …… | .. | …. | …% |
System Processing Steps
Metadata Analysis
- LLM identifies the dataset as portfolio holdings data
- Suggests table name:
securities_list
- Identifies
ID
as primary key candidate
Type Detection
ID
: NUMBER (sequence number)Name
: TEXT (Company Name)ISIN
: TEXT (Security Identifier)Country
: TEXT (Country of origin)Currency
: TEXT (Currency of the security)Exchange
: TEXT (Trading exchange)Type of Security
: TEXT (e.g., Equity, Bond)Rating
: TEXT (Credit rating)Industry Classification
: TEXT (Sector classification)Weighting
: REAL (Percentage weight in the portfolio)
SQL Schema Generation Automatically generated DDL for the table:
CREATE TABLE securities_list (
id INTEGER NOT NULL,
name TEXT NOT NULL,
isin TEXT NOT NULL,
country TEXT,
currency TEXT NOT NULL,
exchange TEXT,
type_of_security TEXT NOT NULL,
rating TEXT,
industry_classification TEXT,
weighting REAL
);
- Data Insertion
Here, the LLM generates automatically the SQL INSERT statements to populate the table with data from the Excel file:
- Handles format conversion (B for billions, % for percentages)
- Validates data integrity
- Inserts all holdings records
Query Examples
Once processed, users can ask natural language questions:
Let’s start with a straightforward question:
Query: “How many rows are there in total?”
Generated SQL
SELECT COUNT(*) as N FROM securities_list;
Output
N |
---|
796 |
Ok, now le’ts see a more complex query that requires aggregation and understanding of the data structure:
Query: “Can you show me the weight of the portfolio for each Country and Sector?”
Generated SQL
SELECT country, industry_classification AS sector, SUM(weighting) AS total_weight
FROM securities_list
GROUP BY country, industry_classification;
Output
country | industry_classification | total_weight |
---|---|---|
- | unknown | 0.00349444 |
Australia | Communication Services | 0.00139149 |
Australia | Consumer Discretionary | 0.00439915 |
Australia | Consumer Staples | 0.00200737 |
Australia | Energy | 0.00214571 |
Australia | Financials | 0.0250382 |
Australia | Health Care | 0.00195356 |
Australia | Industrials | 0.00295084 |
Australia | Information Technology | 0.000675708 |
…. | …. | …. |
Now, let’s take it a step further and apply some where conditions:
Query: “Show me the top 5 Non-European holdings by weight”
Generated SQL
SELECT name, country, weighting
FROM securities_list
WHERE country NOT IN ('Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom')
ORDER BY weighting DESC
LIMIT 5;
Output
name | Country | weighting |
---|---|---|
COMMONWEALTH BANK OF AUSTRALIA | Australia | 0.00917773 |
ROYAL BANK OF CANADA | Canada | 0.00869856 |
TOYOTA MOTOR CORP | Japan | 0.00762924 |
MITSUBISHI UFJ FINANCIAL GROUP INC | Japan | 0.00730266 |
SONY GROUP CORP | Japan | 0.00721053 |
Really not bad!
Results
I haven’t performed an exhaustive analysis, but I did run several spot checks and the results were almost always accurate. Even though I developed this system in just a few hours, it has proven to be highly effective and accurate:
- Processed all rows in the excel in the first run
- Correctly identified and converted data types
- Generated accurate SQL for complex queries
- Provided instant answers to portfolio analysis questions
Implementation Highlights
Smart Type Detection
The system uses a two-stage approach:
- LLM Analysis: Understands context and business meaning
- Statistical Validation: Confirms patterns in actual data
Robust Data Conversion
Handles common Excel formatting issues:
- Currency symbols and abbreviations (K, M, B)
- Percentage formatting
- Date variations
- Empty cells and data validation
Context-Aware SQL Generation
Within the prompt, the LLM receives:
- Complete database schema
- Sample data for context
- Column relationships
- Previous successful queries (for learning)
Limitations
While this approach is powerful, it has some limitations:
- Effectiveness: it’s highly dependent on the quality of the excel file. The example file used in this post is a pretty standard “table-like” excel file, but it’s common to have complicated structures, like pivot tables, merged cells, or complex formulas that may not be easily interpretable. So to get the best results, the excel file should be cleaned up before processing.
- LLM Limitations: The LLM’s ability to understand complex queries is still evolving. It may struggle with highly technical or domain-specific questions.
- Performance: For very large Excel files, the initial analysis and SQL generation may take time. However, once the schema is established, queries are fast.
- Data Integrity: The system assumes the Excel data is clean and well-structured. If the data contains errors or inconsistencies, it may lead to incorrect SQL generation or results.
Conclusions and Future Evolution
This approach solves the fundamental mismatch between RAG and structured data by:
- Preserving data relationships and structure
- Enabling complex analytical queries
- Providing exact, calculated results
- Maintaining data integrity and types
Potential Enhancements
1. Multi-Table Relationships
To handle more complex Excel files with multiple sheets and table relationships, the system could be extended to manage relationships between multiple tables. For example, if the Excel file contains one sheet with holdings and another with sector mapping, the system could automatically generate the necessary JOINs to analyze the data more complexly.
2. Advanced Analytics Integration
- Statistical functions (correlation, regression)
- Time series analysis for historical data
- Machine learning model integration
3. Visualization Pipeline
# Future enhancement: Auto-generate charts
def generate_visualization(query_result, question):
# Analyze result structure
# Choose appropriate chart type
# Generate visualization code
pass
4. Multi-Format Support
- Google Sheets integration
- CSV batch processing
- Database export compatibility
5. Query Optimization
- Query caching and reuse
- Index suggestions
- Performance monitoring
Business Applications
This system opens up numerous use cases:
- Financial Analysis: Portfolio composition, risk assessment
- Sales Analytics: Performance tracking, trend analysis
- Inventory Management: Stock levels, demand forecasting
- HR Analytics: Workforce composition, performance metrics
- Marketing Analytics: Campaign performance, customer segmentation
The key insight is that structured data needs structured solutions. By combining the natural language understanding of LLMs with the precise capabilities of SQL, we can create powerful tools that make complex data analysis accessible to everyone.