Abstract

As AI developers, we’re always looking for ways to make data more accessible and queryable through natural language. While Retrieval-Augmented Generation (RAG) has revolutionized how we interact with unstructired textual documents, it falls short when dealing with structured data. The RAG approach is so powerful that users or even early stage AI developers may fall in the illusion that it can be applied to any kind of data, including structured data like Excel files. However, this is a misconception that can lead to frustration and inefficiency. One of most ubiquitous kind of file asset across all organization is the Excel file format, which could also be considered as structured or “semi-structured” at least. Anyone who has tryed to process an Excel file using the standard Rag approach, quickly realized there is no real value with processing excel files the same way as PDFs.

In this post, I’ll share how I built a system that combines some prompting techniques to create a powerful Excel analysis tool based on SQL.

All the code is available on GitHub

Why RAG Doesn’t Work with Excel Files

RAG has been designed for enrich the LLM prompt with unstructured text from a large corpus of documents. To identify meaningful text chunks, semantic similarity it’s used and here’s why it struggles with Excel data:

1. The RAG Architecture

RAG Architecture is designed for unstructured text, where semantic similarity is key, but Excel data requires exact matches, aggregations, and complex relationships that semantic search simply can’t provide.
RAG treats everything as flat text, missing essential contextual clues like column headers, data types, and relationships
The chunking process in RAG breaks down the inherent structure of Excel files, leading to loss of critical information.

2. The nature of data in Excel files

Excel files are inherently structured or semi-structured, with rows and columns that define relationships between data points.
Most likely, the user’s intent is not to extract information in a “narrative” way, but rather to perform calculations, aggregations, and statistical analyses.
In structured data, column headers, data types, and relationships between tables are critical. RAG treats everything as flat text, missing these essential contextual clues.

The Solution: LLM-Powered Excel-to-SQL Pipeline

LLMs capabilities are growing rapidly and today one of the most proficient area is code generation, especially when it comes to SQL queries. So, instead of trying to force RAG into a structured data world, I’ve built a system that embraces the structured nature of Excel files and uses LLMs to convert the excel data into a SQL database schema. This allows us to leverage the power of SQL for querying and analyzing the data.

Here’s the architecture:

graph TD; A[Excel File Upload] --> B[LLM Metadata Analysis]; B --> C[Column Type Detection]; C --> D[SQL Schema Generation]; D --> E[Data Insertion]; E --> F[Ready for Queries]; G[Natural Language Query] --> H[LLM SQL Generation]; H --> I[Query Execution]; I --> J[Results & Visualization]; F --> H; style A fill:#e1f5fe; style F fill:#e8f5e8; style J fill:#fff3e0;

As for the DB, I used SQLite for simplicity, but this architecture can be adapted to any SQL database. As for the LLM, I used OpenAI’s gpt-4.1-mini, but you can use any comparable LLM.

System Components

The pipeline consists of four main components:

1. Metadata Analyzer

Uses an LLM to analyze sheet names and column headers, inferring the purpose and structure of the data:

metadata_prompt = PromptTemplate(
    input_variables=["sheet_name", "columns"],
    partial_variables={
        "format_instructions": metadata_parser.get_format_instructions()
    },
    template="""
    Analyze the following metadata of an Excel sheet:
    
    Sheet name: {sheet_name}
    Columns: {columns}
    
    Based on the column names, please return the following information:
        - suggested_table_name
        - description of the table content
        - primary_key_column_name_if_present
        - category (e.g., financial, sales, inventory, etc.)

    {format_instructions}
    
    Use table names in snake_case and in English.
    """
)

2. Type Detection Engine

Combines LLM analysis with statistical sampling to determine the correct SQL data types:

graph LR A[Sample Data] --> B[LLM Analysis] A --> C[Statistical Analysis] B --> D[Final Type Decision] C --> D D --> E[SQL Schema]

At the end of this process, the system generates a SQL schema that accurately represents the data types and relationships in the Excel file and executes it to create the table in the database.

3. SQL Generator

Converts natural language questions into SQL queries using the database schema as context:

sql_prompt = PromptTemplate(
    input_variables=["question", "schema", "sample_data"],
    template="""
    Generate an SQL query to answer the following question:
    
    Question: {question}
    
    Database schema:
    {schema}
    
    Sample data:
    {sample_data}
    
    Generate ONLY the SQL query without any additional explanations.
    Use standard SQLite syntax.
    """
)

4. Query Executor

Executes the generated SQL and formats results for presentation.

Real-World Example: ETF Portfolio Analysis

Let me walk you through a concrete example using an XTrackers ETF holdings composition as example file. The Excel file is pretty simple and contains the breakdown of the “Xtrackers MSCI World ex USA UCITS” ETF, with related underlying stocks, their weight in the portfolio, and sector classification.

Input Excel File Structure

ID	Name	ISIN	Country	Currency	Exchange	Type of Security	Rating	Industry Classification	Weighting
1	SAP	DE0007164600	Germany	EUR	XETRA	Equity	-	Information Technology	1.47%
2	ASML HOLDING NV	NL0010273215	Netherlands	EUR	Euronext Amsterdam	Equity	Baa2	Information Technology	1.46%
3	NESTLE SA	CH0038863350	Switzerland	CHF	Scoach Switzerland	Equity	Aa2	Consumer Staples	1.22%
4	NOVARTIS AG	CH0012005267	Switzerland	CHF	Scoach Switzerland	Equity	Aa3	Health Care	1.08%
5	ROCHE HOLDING PAR AG	CH0012032048	Switzerland	CHF	Scoach Switzerland	Equity	A1	Health Care	1.06%
6	…	…..	….	…	…	……	..	….	…%

System Processing Steps

Metadata Analysis
- LLM identifies the dataset as portfolio holdings data
- Suggests table name: securities_list
- Identifies ID as primary key candidate
Type Detection
- ID: NUMBER (sequence number)
- Name: TEXT (Company Name)
- ISIN: TEXT (Security Identifier)
- Country: TEXT (Country of origin)
- Currency: TEXT (Currency of the security)
- Exchange: TEXT (Trading exchange)
- Type of Security: TEXT (e.g., Equity, Bond)
- Rating: TEXT (Credit rating)
- Industry Classification: TEXT (Sector classification)
- Weighting: REAL (Percentage weight in the portfolio)
SQL Schema Generation Automatically generated DDL for the table:

   CREATE TABLE securities_list (
    id INTEGER NOT NULL,
    name TEXT NOT NULL,
    isin TEXT NOT NULL,
    country TEXT,
    currency TEXT NOT NULL,
    exchange TEXT,
    type_of_security TEXT NOT NULL,
    rating TEXT,
    industry_classification TEXT,
    weighting REAL
);

Data Insertion Here, the LLM generates automatically the SQL INSERT statements to populate the table with data from the Excel file:
- Handles format conversion (B for billions, % for percentages)
- Validates data integrity
- Inserts all holdings records

Query Examples

Once processed, users can ask natural language questions:

Let’s start with a straightforward question:

Query: “How many rows are there in total?”

Generated SQL

SELECT COUNT(*) as N FROM securities_list;

Output

N
796

Ok, now le’ts see a more complex query that requires aggregation and understanding of the data structure:

Query: “Can you show me the weight of the portfolio for each Country and Sector?”

Generated SQL

SELECT country, industry_classification AS sector, SUM(weighting) AS total_weight
FROM securities_list
GROUP BY country, industry_classification;

Output

country	industry_classification	total_weight
-	unknown	0.00349444
Australia	Communication Services	0.00139149
Australia	Consumer Discretionary	0.00439915
Australia	Consumer Staples	0.00200737
Australia	Energy	0.00214571
Australia	Financials	0.0250382
Australia	Health Care	0.00195356
Australia	Industrials	0.00295084
Australia	Information Technology	0.000675708
….	….	….

Now, let’s take it a step further and apply some where conditions:

Query: “Show me the top 5 Non-European holdings by weight”

Generated SQL

SELECT name, country, weighting
FROM securities_list
WHERE country NOT IN ('Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom')
ORDER BY weighting DESC
LIMIT 5;

Output

name	Country	weighting
COMMONWEALTH BANK OF AUSTRALIA	Australia	0.00917773
ROYAL BANK OF CANADA	Canada	0.00869856
TOYOTA MOTOR CORP	Japan	0.00762924
MITSUBISHI UFJ FINANCIAL GROUP INC	Japan	0.00730266
SONY GROUP CORP	Japan	0.00721053

Really not bad!

Results

I haven’t performed an exhaustive analysis, but I did run several spot checks and the results were almost always accurate. Even though I developed this system in just a few hours, it has proven to be highly effective and accurate:

Processed all rows in the excel in the first run
Correctly identified and converted data types
Generated accurate SQL for complex queries
Provided instant answers to portfolio analysis questions

Implementation Highlights

Smart Type Detection

The system uses a two-stage approach:

LLM Analysis: Understands context and business meaning
Statistical Validation: Confirms patterns in actual data

Robust Data Conversion

Handles common Excel formatting issues:

Currency symbols and abbreviations (K, M, B)
Percentage formatting
Date variations
Empty cells and data validation

Context-Aware SQL Generation

Within the prompt, the LLM receives:

Complete database schema
Sample data for context
Column relationships
Previous successful queries (for learning)

Limitations

While this approach is powerful, it has some limitations:

Effectiveness: it’s highly dependent on the quality of the excel file. The example file used in this post is a pretty standard “table-like” excel file, but it’s common to have complicated structures, like pivot tables, merged cells, or complex formulas that may not be easily interpretable. So to get the best results, the excel file should be cleaned up before processing.
LLM Limitations: The LLM’s ability to understand complex queries is still evolving. It may struggle with highly technical or domain-specific questions.
Performance: For very large Excel files, the initial analysis and SQL generation may take time. However, once the schema is established, queries are fast.
Data Integrity: The system assumes the Excel data is clean and well-structured. If the data contains errors or inconsistencies, it may lead to incorrect SQL generation or results.

Conclusions and Future Evolution

This approach solves the fundamental mismatch between RAG and structured data by:

Preserving data relationships and structure
Enabling complex analytical queries
Providing exact, calculated results
Maintaining data integrity and types

Potential Enhancements

1. Multi-Table Relationships

To handle more complex Excel files with multiple sheets and table relationships, the system could be extended to manage relationships between multiple tables. For example, if the Excel file contains one sheet with holdings and another with sector mapping, the system could automatically generate the necessary JOINs to analyze the data more complexly.

graph TD A[Holdings Table] --> C[JOIN Operations] B[Sector Mapping] --> C C --> D[Complex Analytics]

2. Advanced Analytics Integration

Statistical functions (correlation, regression)
Time series analysis for historical data
Machine learning model integration

3. Visualization Pipeline

# Future enhancement: Auto-generate charts
def generate_visualization(query_result, question):
    # Analyze result structure
    # Choose appropriate chart type
    # Generate visualization code
    pass

4. Multi-Format Support

Google Sheets integration
CSV batch processing
Database export compatibility

5. Query Optimization

Query caching and reuse
Index suggestions
Performance monitoring

Business Applications

This system opens up numerous use cases:

Financial Analysis: Portfolio composition, risk assessment
Sales Analytics: Performance tracking, trend analysis
Inventory Management: Stock levels, demand forecasting
HR Analytics: Workforce composition, performance metrics
Marketing Analytics: Campaign performance, customer segmentation

The key insight is that structured data needs structured solutions. By combining the natural language understanding of LLMs with the precise capabilities of SQL, we can create powerful tools that make complex data analysis accessible to everyone.

Abstract#

Why RAG Doesn’t Work with Excel Files#

1. The RAG Architecture#

2. The nature of data in Excel files#

The Solution: LLM-Powered Excel-to-SQL Pipeline#

System Components#

1. Metadata Analyzer#

2. Type Detection Engine#

3. SQL Generator#

4. Query Executor#

Real-World Example: ETF Portfolio Analysis#

Input Excel File Structure#

System Processing Steps#

Query Examples#

Query: “How many rows are there in total?”#

Generated SQL#

Output#

Query: “Can you show me the weight of the portfolio for each Country and Sector?”#

Generated SQL#

Output#

Query: “Show me the top 5 Non-European holdings by weight”#

Generated SQL#

Output#

Results#

Implementation Highlights#

Smart Type Detection#

Robust Data Conversion#

Context-Aware SQL Generation#

Limitations#

Conclusions and Future Evolution#

Potential Enhancements#

1. Multi-Table Relationships#

2. Advanced Analytics Integration#

3. Visualization Pipeline#

4. Multi-Format Support#

5. Query Optimization#

Business Applications#

Abstract

Why RAG Doesn’t Work with Excel Files

1. The RAG Architecture

2. The nature of data in Excel files

The Solution: LLM-Powered Excel-to-SQL Pipeline

System Components

1. Metadata Analyzer

2. Type Detection Engine

3. SQL Generator

4. Query Executor

Real-World Example: ETF Portfolio Analysis

Input Excel File Structure

System Processing Steps

Query Examples

Query: “How many rows are there in total?”

Generated SQL

Output

Query: “Can you show me the weight of the portfolio for each Country and Sector?”

Generated SQL

Output

Query: “Show me the top 5 Non-European holdings by weight”

Generated SQL

Output

Results

Implementation Highlights

Smart Type Detection

Robust Data Conversion

Context-Aware SQL Generation

Limitations

Conclusions and Future Evolution

Potential Enhancements

1. Multi-Table Relationships

2. Advanced Analytics Integration

3. Visualization Pipeline

4. Multi-Format Support

5. Query Optimization

Business Applications