Abstract

As AI developers, we’re always looking for ways to make data more accessible and queryable through natural language. While Retrieval-Augmented Generation (RAG) has revolutionized how we interact with unstructired textual documents, it falls short when dealing with structured data. The RAG approach is so powerful that users or even early stage AI developers may fall in the illusion that it can be applied to any kind of data, including structured data like Excel files. However, this is a misconception that can lead to frustration and inefficiency. One of most ubiquitous kind of file asset across all organization is the Excel file format, which could also be considered as structured or “semi-structured” at least. Anyone who has tryed to process an Excel file using the standard Rag approach, quickly realized there is no real value with processing excel files the same way as PDFs.

In this post, I’ll share how I built a system that combines some prompting techniques to create a powerful Excel analysis tool based on SQL.

All the code is available on GitHub

Why RAG Doesn’t Work with Excel Files

RAG has been designed for enrich the LLM prompt with unstructured text from a large corpus of documents. To identify meaningful text chunks, semantic similarity it’s used and here’s why it struggles with Excel data:

1. The RAG Architecture

  • RAG Architecture is designed for unstructured text, where semantic similarity is key, but Excel data requires exact matches, aggregations, and complex relationships that semantic search simply can’t provide.
  • RAG treats everything as flat text, missing essential contextual clues like column headers, data types, and relationships
  • The chunking process in RAG breaks down the inherent structure of Excel files, leading to loss of critical information.

2. The nature of data in Excel files

  • Excel files are inherently structured or semi-structured, with rows and columns that define relationships between data points.
  • Most likely, the user’s intent is not to extract information in a “narrative” way, but rather to perform calculations, aggregations, and statistical analyses.
  • In structured data, column headers, data types, and relationships between tables are critical. RAG treats everything as flat text, missing these essential contextual clues.

The Solution: LLM-Powered Excel-to-SQL Pipeline

LLMs capabilities are growing rapidly and today one of the most proficient area is code generation, especially when it comes to SQL queries. So, instead of trying to force RAG into a structured data world, I’ve built a system that embraces the structured nature of Excel files and uses LLMs to convert the excel data into a SQL database schema. This allows us to leverage the power of SQL for querying and analyzing the data.

Here’s the architecture:

graph TD; A[Excel File Upload] --> B[LLM Metadata Analysis]; B --> C[Column Type Detection]; C --> D[SQL Schema Generation]; D --> E[Data Insertion]; E --> F[Ready for Queries]; G[Natural Language Query] --> H[LLM SQL Generation]; H --> I[Query Execution]; I --> J[Results & Visualization]; F --> H; style A fill:#e1f5fe; style F fill:#e8f5e8; style J fill:#fff3e0;

As for the DB, I used SQLite for simplicity, but this architecture can be adapted to any SQL database. As for the LLM, I used OpenAI’s gpt-4.1-mini, but you can use any comparable LLM.

System Components

The pipeline consists of four main components:

1. Metadata Analyzer

Uses an LLM to analyze sheet names and column headers, inferring the purpose and structure of the data:

metadata_prompt = PromptTemplate(
    input_variables=["sheet_name", "columns"],
    partial_variables={
        "format_instructions": metadata_parser.get_format_instructions()
    },
    template="""
    Analyze the following metadata of an Excel sheet:
    
    Sheet name: {sheet_name}
    Columns: {columns}
    
    Based on the column names, please return the following information:
        - suggested_table_name
        - description of the table content
        - primary_key_column_name_if_present
        - category (e.g., financial, sales, inventory, etc.)

    {format_instructions}
    
    Use table names in snake_case and in English.
    """
)

2. Type Detection Engine

Combines LLM analysis with statistical sampling to determine the correct SQL data types:

graph LR A[Sample Data] --> B[LLM Analysis] A --> C[Statistical Analysis] B --> D[Final Type Decision] C --> D D --> E[SQL Schema]

At the end of this process, the system generates a SQL schema that accurately represents the data types and relationships in the Excel file and executes it to create the table in the database.

3. SQL Generator

Converts natural language questions into SQL queries using the database schema as context:

sql_prompt = PromptTemplate(
    input_variables=["question", "schema", "sample_data"],
    template="""
    Generate an SQL query to answer the following question:
    
    Question: {question}
    
    Database schema:
    {schema}
    
    Sample data:
    {sample_data}
    
    Generate ONLY the SQL query without any additional explanations.
    Use standard SQLite syntax.
    """
)

4. Query Executor

Executes the generated SQL and formats results for presentation.

Real-World Example: ETF Portfolio Analysis

Let me walk you through a concrete example using an XTrackers ETF holdings composition as example file. The Excel file is pretty simple and contains the breakdown of the “Xtrackers MSCI World ex USA UCITS” ETF, with related underlying stocks, their market value, weight in the portfolio, and sector classification.

Input Excel File Structure

IDNameISINCountryCurrencyExchangeType of SecurityRatingIndustry ClassificationWeighting
1SAPDE0007164600GermanyEURXETRAEquity-Information Technology1.47%
2ASML HOLDING NVNL0010273215NetherlandsEUREuronext AmsterdamEquityBaa2Information Technology1.46%
3NESTLE SACH0038863350SwitzerlandCHFScoach SwitzerlandEquityAa2Consumer Staples1.22%
4NOVARTIS AGCH0012005267SwitzerlandCHFScoach SwitzerlandEquityAa3Health Care1.08%
5ROCHE HOLDING PAR AGCH0012032048SwitzerlandCHFScoach SwitzerlandEquityA1Health Care1.06%
6…..….……..….…%

System Processing Steps

  1. Metadata Analysis

    • LLM identifies the dataset as portfolio holdings data
    • Suggests table name: securities_list
    • Identifies ID as primary key candidate
  2. Type Detection

    • ID: NUMBER (sequence number)
    • Name: TEXT (Company Name)
    • ISIN: TEXT (Security Identifier)
    • Country: TEXT (Country of origin)
    • Currency: TEXT (Currency of the security)
    • Exchange: TEXT (Trading exchange)
    • Type of Security: TEXT (e.g., Equity, Bond)
    • Rating: TEXT (Credit rating)
    • Industry Classification: TEXT (Sector classification)
    • Weighting: REAL (Percentage weight in the portfolio)
  3. SQL Schema Generation Automatically generated DDL for the table:

   CREATE TABLE securities_list (
    id INTEGER NOT NULL,
    name TEXT NOT NULL,
    isin TEXT NOT NULL,
    country TEXT,
    currency TEXT NOT NULL,
    exchange TEXT,
    type_of_security TEXT NOT NULL,
    rating TEXT,
    industry_classification TEXT,
    weighting REAL
);
  1. Data Insertion Here, the LLM generates automatically the SQL INSERT statements to populate the table with data from the Excel file:
    • Handles format conversion (B for billions, % for percentages)
    • Validates data integrity
    • Inserts all holdings records

Query Examples

Once processed, users can ask natural language questions:

Let’s start with a straightforward question:


Query: “How many rows are there in total?

Generated SQL

SELECT COUNT(*) as N FROM securities_list;

Output

N
796

Ok, now le’ts see a more complex query that requires aggregation and understanding of the data structure:

Query: “Can you show me the weight of the portfolio for each Country and Sector?

Generated SQL

SELECT country, industry_classification AS sector, SUM(weighting) AS total_weight
FROM securities_list
GROUP BY country, industry_classification;

Output

countryindustry_classificationtotal_weight
-unknown0.00349444
AustraliaCommunication Services0.00139149
AustraliaConsumer Discretionary0.00439915
AustraliaConsumer Staples0.00200737
AustraliaEnergy0.00214571
AustraliaFinancials0.0250382
AustraliaHealth Care0.00195356
AustraliaIndustrials0.00295084
AustraliaInformation Technology0.000675708
….….….

Now, let’s take it a step further and apply some where conditions:

Query: “Show me the top 5 Non-European holdings by weight

Generated SQL

SELECT name, country, weighting
FROM securities_list
WHERE country NOT IN ('Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom')
ORDER BY weighting DESC
LIMIT 5;

Output

nameCountryweighting
COMMONWEALTH BANK OF AUSTRALIAAustralia0.00917773
ROYAL BANK OF CANADACanada0.00869856
TOYOTA MOTOR CORPJapan0.00762924
MITSUBISHI UFJ FINANCIAL GROUP INCJapan0.00730266
SONY GROUP CORPJapan0.00721053

Really not bad!

Results

I haven’t performed an exhaustive analysis, but I did run several spot checks and the results were almost always accurate. Even though I developed this system in just a few hours, it has proven to be highly effective and accurate:

  • Processed all rows in the excel in the first run
  • Correctly identified and converted data types
  • Generated accurate SQL for complex queries
  • Provided instant answers to portfolio analysis questions

Implementation Highlights

Smart Type Detection

The system uses a two-stage approach:

  1. LLM Analysis: Understands context and business meaning
  2. Statistical Validation: Confirms patterns in actual data

Robust Data Conversion

Handles common Excel formatting issues:

  • Currency symbols and abbreviations (K, M, B)
  • Percentage formatting
  • Date variations
  • Empty cells and data validation

Context-Aware SQL Generation

Within the prompt, the LLM receives:

  • Complete database schema
  • Sample data for context
  • Column relationships
  • Previous successful queries (for learning)

Limitations

While this approach is powerful, it has some limitations:

  • Effectiveness: it’s highly dependent on the quality of the excel file. The example file used in this post is a pretty standard “table-like” excel file, but it’s common to have complicated structures, like pivot tables, merged cells, or complex formulas that may not be easily interpretable. So to get the best results, the excel file should be cleaned up before processing.
  • LLM Limitations: The LLM’s ability to understand complex queries is still evolving. It may struggle with highly technical or domain-specific questions.
  • Performance: For very large Excel files, the initial analysis and SQL generation may take time. However, once the schema is established, queries are fast.
  • Data Integrity: The system assumes the Excel data is clean and well-structured. If the data contains errors or inconsistencies, it may lead to incorrect SQL generation or results.

Conclusions and Future Evolution

This approach solves the fundamental mismatch between RAG and structured data by:

  • Preserving data relationships and structure
  • Enabling complex analytical queries
  • Providing exact, calculated results
  • Maintaining data integrity and types

Potential Enhancements

1. Multi-Table Relationships

To handle more complex Excel files with multiple sheets and table relationships, the system could be extended to manage relationships between multiple tables. For example, if the Excel file contains one sheet with holdings and another with sector mapping, the system could automatically generate the necessary JOINs to analyze the data more complexly.

graph TD A[Holdings Table] --> C[JOIN Operations] B[Sector Mapping] --> C C --> D[Complex Analytics]

2. Advanced Analytics Integration

  • Statistical functions (correlation, regression)
  • Time series analysis for historical data
  • Machine learning model integration

3. Visualization Pipeline

# Future enhancement: Auto-generate charts
def generate_visualization(query_result, question):
    # Analyze result structure
    # Choose appropriate chart type
    # Generate visualization code
    pass

4. Multi-Format Support

  • Google Sheets integration
  • CSV batch processing
  • Database export compatibility

5. Query Optimization

  • Query caching and reuse
  • Index suggestions
  • Performance monitoring

Business Applications

This system opens up numerous use cases:

  • Financial Analysis: Portfolio composition, risk assessment
  • Sales Analytics: Performance tracking, trend analysis
  • Inventory Management: Stock levels, demand forecasting
  • HR Analytics: Workforce composition, performance metrics
  • Marketing Analytics: Campaign performance, customer segmentation

The key insight is that structured data needs structured solutions. By combining the natural language understanding of LLMs with the precise capabilities of SQL, we can create powerful tools that make complex data analysis accessible to everyone.