CEO at Kaelio

April 22, 2026Last reviewed May 24, 20267 min read

Context layer AI data agent

How to Evaluate Text-to-SQL on Your Own Data, Not Vendor Benchmarks

At a glance

BIRD contains 12,751 question-SQL pairs across 95 databases and more than 37 professional domains, which makes it one of the strongest public baselines for Text-to-SQL evaluation.
Spider 2.0 is built from 632 real-world enterprise workflow problems and reports that even o1-preview solves only 17.1 percent of tasks, versus 86.6 percent on Spider 1.0.
MMTU expands the evaluation lens beyond NL-to-SQL into broader table reasoning with about 28K questions across 25 real-world table tasks, showing that table intelligence is wider than one benchmark family.
BigQuery conversational analytics states that direct conversations are less accurate than data agents because direct conversations do not include the same context and processing instructions.
BigQuery authored context guidance recommends table descriptions, synonyms, relationships, glossaries, and example queries to improve response quality.
Snowflake Cortex Analyst says generic schema-only approaches struggle because schemas lack business-process definitions and metric handling.
Snowflake's verified-query workflow uses validated example queries to improve semantic models and broaden the range of questions the system can answer correctly.
Snowflake Semantic Views emphasize that semantic definitions solve the gap between how business users describe data and how data is stored physically.

Reading time

7 minutes

Last reviewed

May 24, 2026

Topics

Context layer AI data agent

Public leaderboards are useful, but they are not procurement criteria. BIRD covers 12,751 question-SQL pairs across 95 real databases, while Spider 2.0 pushes into enterprise-style workflows with 632 tasks and shows even strong models struggling on real-world complexity. Neither benchmark knows what "qualified pipeline" means in your company.

What You Are Actually Evaluating

When a vendor says "our Text-to-SQL is accurate," the question is accurate at what?

For a data leader, the answer should cover at least five things:

Business correctness: did the output match the answer your team would approve?
Join correctness: were the tables and relationships valid?
Definition correctness: did the system use the approved metric logic?
Access correctness: did the answer respect the requesting user's permissions?
Stability: does the system return the same answer when the question is asked a few different ways?

This is why public benchmarks help with orientation, not final selection. They tell you whether the model class is credible. They do not tell you whether the tool is safe to deploy on your warehouse.

Why Vendor Benchmarks Mislead

Benchmarks are real and useful. They are also easy to misuse.

BIRD and Spider 2.0 measure hard technical capability, and MMTU shows that table understanding goes well beyond simple SQL generation. But the things that break production analytics are often local:

internal synonyms
unofficial but common metric names
deprecated tables that still look valid
awkward join paths
role-specific access boundaries
business rules that live in docs instead of schemas

BigQuery acknowledges this directly by distinguishing between a context-rich data agent and a direct conversation. Snowflake Cortex Analyst makes the same point from another angle: schemas alone do not encode business-process definitions and metric handling.

If your evaluation ignores that local context, the result is not a production test. It is a product demo with extra steps.

Build a Test Set From Real Questions

Start with 30 to 50 real questions your business already asks.

Pull them from:

Slack threads
analyst tickets
recurring executive asks
dashboard follow-up requests
finance and RevOps review meetings

Your test set should contain five categories:

1. Simple Metric Lookups

Examples:

What was net revenue last quarter?
How many enterprise customers churned last month?

These should pass, but they are not enough to differentiate tools.

2. Multi-Table Business Questions

Examples:

Which opportunities above $100K have open support escalations and declining usage?
Which pipeline segments have the highest conversion but lowest net retention?

These expose join quality and relationship reasoning.

3. Ambiguous Language

Examples:

What is our best segment?
Which customers are at risk?

These expose whether the system asks clarifying questions, uses approved defaults, or guesses.

4. Restricted-Access Questions

Examples:

Ask the same finance question as a finance admin and as a sales manager.

These expose access leaks. A system that returns a correct answer to the wrong person still fails.

5. Follow-Up and Iterative Questions

Examples:

Which region missed plan?
Break that down by segment.
Now exclude partner-sourced deals.

These expose whether the tool can maintain context without drifting.

Score Against Business Outcomes, Not SQL Syntax

A valid SQL query is not the same thing as a good answer.

Use a scoring sheet like this:

Dimension	What to check
Business correctness	Does the answer match the approved result?
SQL correctness	Does the query run and use valid syntax?
Join correctness	Are the right tables and join paths used?
Metric correctness	Are official formulas, filters, and aggregations applied?
Access correctness	Does the result respect user-level permissions?
Transparency	Can the reviewer inspect SQL, sources, or reasoning?
Consistency	Does the tool return the same answer across paraphrases?
Recovery	Does it clarify or recover when the question is ambiguous?

This scoring model is stricter than benchmark accuracy, but that is the point. Production analytics has stricter requirements than research comparison.

Evaluate With and Without Context

One of the clearest ways to understand a system is to test it in two modes:

Schema-only or minimal-context mode
Context-rich mode with definitions, relationships, glossary terms, and examples

BigQuery's authored context guide provides a concrete pattern: add descriptions, synonyms, tags, relationships, glossary definitions, and example queries. Snowflake uses verified queries to improve semantic concepts and answer a broader range of questions correctly.

This is an important test because it tells you whether the tool benefits from your current metadata maturity, and whether improving context is likely to improve outcomes after launch.

Require a Verified Baseline

For the most important questions, you need a reviewer-approved answer set.

That means:

the expected result is documented
the approved SQL exists when possible
the metric definition is identified
the owner signs off on correctness

Without a verified baseline, disagreement turns into opinion. With a baseline, you can measure progress and regressions.

This is also the best way to avoid false confidence. Many teams over-credit a system for being fluent when the real test is repeatable correctness.

Include Security and Governance in the Evaluation

Text-to-SQL evaluation fails when it is treated only as a modeling exercise.

Your evaluation should include:

role-based question replay
sensitive-field access checks
audit-log availability
citation or lineage inspection
escalation behavior for low-confidence or out-of-scope prompts

If the system does not support those checks well, the limitation should be visible during evaluation, not after rollout.

What Good Looks Like

A good Text-to-SQL evaluation usually produces a decision like this:

safe to roll out now for these 20 questions and these two teams
promising, but needs better context for joins and definitions
not ready, because access handling or ambiguity recovery is weak

That is a much more useful conclusion than a single accuracy number. It tells you where the system is deployable and what must improve next.

How a Context Layer Makes the Evaluation More Realistic

The biggest evaluation mistake is testing the model in a condition your production stack will never use.

Kaelio's Data Agent delivers governed answers, digests, and follow-up analysis in plain English. It is powered by ktx, the context layer that keeps answers tied to approved metrics, lineage, permissions, and source systems.

For evaluation work, that means your question set can be tested against the same metric definitions, relationships, lineage, and business documentation the production system will actually use. Instead of asking whether a model can infer everything from raw schemas, you ask whether the full governed system can answer your real questions safely and consistently.

That is the right bar for a data leader.

Practical Evaluation Workflow

Use this workflow:

Collect 30 to 50 real questions.
Group them by difficulty, ambiguity, and access sensitivity.
Create an approved baseline for the most important questions.
Test schema-only behavior.
Test context-rich behavior with definitions, examples, and relationships.
Replay questions as different user roles.
Record correctness, consistency, and failure modes.
Decide where the system is safe to launch first.

If you want the broader tool-selection version of this process, see how to evaluate AI analytics tools. If you want the concept background first, start with what is Text-to-SQL and how accurate are AI data analyst tools.

FAQ

Why are public Text-to-SQL benchmarks not enough for vendor evaluation?

Public benchmarks are useful for comparing general capability, but they do not capture your schema ambiguity, metric definitions, access policies, or business vocabulary. Production evaluation has to use your own data and your own questions.

How many test questions should I use in a Text-to-SQL evaluation?

Most teams should start with 30 to 50 real business questions. That is enough to expose failures in joins, permissions, ambiguity handling, and follow-up reasoning without turning the evaluation into a months-long research project.

What should count as a successful Text-to-SQL answer?

A successful answer is not just valid SQL. It returns the correct business result, uses the right joins and metric definitions, respects access controls, and provides enough transparency to be audited or corrected.

How do context and example queries affect Text-to-SQL accuracy?

Context and example queries usually matter a lot. BigQuery recommends structured context, glossaries, relationships, and example queries for higher-quality answers, and Snowflake uses verified queries to improve semantic models and answer a broader range of questions correctly.

How does Kaelio help evaluate and improve Text-to-SQL?

Kaelio auto-builds a governed context layer from your data stack. Its Data Agent, and any MCP-compatible agent, can then evaluate questions against shared metric definitions, lineage, and source context, which makes Text-to-SQL tests more representative of production conditions.

How to Evaluate Text-to-SQL on Your Own Data, Not Vendor Benchmarks

At a glance

What You Are Actually Evaluating

Why Vendor Benchmarks Mislead

Build a Test Set From Real Questions

1. Simple Metric Lookups

2. Multi-Table Business Questions

3. Ambiguous Language

4. Restricted-Access Questions

5. Follow-Up and Iterative Questions

Score Against Business Outcomes, Not SQL Syntax

Evaluate With and Without Context

Require a Verified Baseline

Include Security and Governance in the Evaluation

What Good Looks Like

How a Context Layer Makes the Evaluation More Realistic

Practical Evaluation Workflow

FAQ

Sources

Give your data agents the context layer
they deserve.

How to Evaluate Text-to-SQL on Your Own Data, Not Vendor Benchmarks

At a glance

What You Are Actually Evaluating

Why Vendor Benchmarks Mislead

Build a Test Set From Real Questions

1. Simple Metric Lookups

2. Multi-Table Business Questions

3. Ambiguous Language

4. Restricted-Access Questions

5. Follow-Up and Iterative Questions

Score Against Business Outcomes, Not SQL Syntax

Evaluate With and Without Context

Require a Verified Baseline

Include Security and Governance in the Evaluation

What Good Looks Like

How a Context Layer Makes the Evaluation More Realistic

Practical Evaluation Workflow

FAQ

Sources

More in Context layer

Who Governs the Agents? Why AI Analytics Needs a Context Layer Outside Your Warehouse

How to Audit Your AI Analytics for Compliance

How to Manage Metric Definition Changes in AI Analytics

Give your data agents the context layer they deserve.

Give your data agents the context layer
they deserve.