Last reviewed April 22, 20267 min read

How to Evaluate Text-to-SQL on Your Own Data, Not Vendor Benchmarks

At a glance

  • BIRD contains 12,751 question-SQL pairs across 95 databases and more than 37 professional domains, which makes it one of the strongest public baselines for Text-to-SQL evaluation.
  • Spider 2.0 is built from 632 real-world enterprise workflow problems and reports that even o1-preview solves only 17.1 percent of tasks, versus 86.6 percent on Spider 1.0.
  • MMTU expands the evaluation lens beyond NL-to-SQL into broader table reasoning with about 28K questions across 25 real-world table tasks, showing that table intelligence is wider than one benchmark family.
  • BigQuery conversational analytics states that direct conversations are less accurate than data agents because direct conversations do not include the same context and processing instructions.
  • BigQuery authored context guidance recommends table descriptions, synonyms, relationships, glossaries, and example queries to improve response quality.
  • Snowflake Cortex Analyst says generic schema-only approaches struggle because schemas lack business-process definitions and metric handling.
  • Snowflake's verified-query workflow uses validated example queries to improve semantic models and broaden the range of questions the system can answer correctly.
  • Snowflake Semantic Views emphasize that semantic definitions solve the gap between how business users describe data and how data is stored physically.

Reading time

7 minutes

Last reviewed

April 22, 2026

Topics

Public leaderboards are useful, but they are not procurement criteria. BIRD covers 12,751 question-SQL pairs across 95 real databases, while Spider 2.0 pushes into enterprise-style workflows with 632 tasks and shows even strong models struggling on real-world complexity. Neither benchmark knows what "qualified pipeline" means in your company.

What You Are Actually Evaluating

When a vendor says "our Text-to-SQL is accurate," the question is accurate at what?

For a data leader, the answer should cover at least five things:

  1. Business correctness: did the output match the answer your team would approve?
  2. Join correctness: were the tables and relationships valid?
  3. Definition correctness: did the system use the approved metric logic?
  4. Access correctness: did the answer respect the requesting user's permissions?
  5. Stability: does the system return the same answer when the question is asked a few different ways?

This is why public benchmarks help with orientation, not final selection. They tell you whether the model class is credible. They do not tell you whether the tool is safe to deploy on your warehouse.

Why Vendor Benchmarks Mislead

Benchmarks are real and useful. They are also easy to misuse.

BIRD and Spider 2.0 measure hard technical capability, and MMTU shows that table understanding goes well beyond simple SQL generation. But the things that break production analytics are often local:

  • internal synonyms
  • unofficial but common metric names
  • deprecated tables that still look valid
  • awkward join paths
  • role-specific access boundaries
  • business rules that live in docs instead of schemas

BigQuery acknowledges this directly by distinguishing between a context-rich data agent and a direct conversation. Snowflake Cortex Analyst makes the same point from another angle: schemas alone do not encode business-process definitions and metric handling.

If your evaluation ignores that local context, the result is not a production test. It is a product demo with extra steps.

Build a Test Set From Real Questions

Start with 30 to 50 real questions your business already asks.

Pull them from:

  • Slack threads
  • analyst tickets
  • recurring executive asks
  • dashboard follow-up requests
  • finance and RevOps review meetings

Your test set should contain five categories:

1. Simple Metric Lookups

Examples:

  • What was net revenue last quarter?
  • How many enterprise customers churned last month?

These should pass, but they are not enough to differentiate tools.

2. Multi-Table Business Questions

Examples:

  • Which opportunities above $100K have open support escalations and declining usage?
  • Which pipeline segments have the highest conversion but lowest net retention?

These expose join quality and relationship reasoning.

3. Ambiguous Language

Examples:

  • What is our best segment?
  • Which customers are at risk?

These expose whether the system asks clarifying questions, uses approved defaults, or guesses.

4. Restricted-Access Questions

Examples:

  • Ask the same finance question as a finance admin and as a sales manager.

These expose access leaks. A system that returns a correct answer to the wrong person still fails.

5. Follow-Up and Iterative Questions

Examples:

  • Which region missed plan?
  • Break that down by segment.
  • Now exclude partner-sourced deals.

These expose whether the tool can maintain context without drifting.

Score Against Business Outcomes, Not SQL Syntax

A valid SQL query is not the same thing as a good answer.

Use a scoring sheet like this:

DimensionWhat to check
Business correctnessDoes the answer match the approved result?
SQL correctnessDoes the query run and use valid syntax?
Join correctnessAre the right tables and join paths used?
Metric correctnessAre official formulas, filters, and aggregations applied?
Access correctnessDoes the result respect user-level permissions?
TransparencyCan the reviewer inspect SQL, sources, or reasoning?
ConsistencyDoes the tool return the same answer across paraphrases?
RecoveryDoes it clarify or recover when the question is ambiguous?

This scoring model is stricter than benchmark accuracy, but that is the point. Production analytics has stricter requirements than research comparison.

Evaluate With and Without Context

One of the clearest ways to understand a system is to test it in two modes:

  1. Schema-only or minimal-context mode
  2. Context-rich mode with definitions, relationships, glossary terms, and examples

BigQuery's authored context guide provides a concrete pattern: add descriptions, synonyms, tags, relationships, glossary definitions, and example queries. Snowflake uses verified queries to improve semantic concepts and answer a broader range of questions correctly.

This is an important test because it tells you whether the tool benefits from your current metadata maturity, and whether improving context is likely to improve outcomes after launch.

Require a Verified Baseline

For the most important questions, you need a reviewer-approved answer set.

That means:

  • the expected result is documented
  • the approved SQL exists when possible
  • the metric definition is identified
  • the owner signs off on correctness

Without a verified baseline, disagreement turns into opinion. With a baseline, you can measure progress and regressions.

This is also the best way to avoid false confidence. Many teams over-credit a system for being fluent when the real test is repeatable correctness.

Include Security and Governance in the Evaluation

Text-to-SQL evaluation fails when it is treated only as a modeling exercise.

Your evaluation should include:

  • role-based question replay
  • sensitive-field access checks
  • audit-log availability
  • citation or lineage inspection
  • escalation behavior for low-confidence or out-of-scope prompts

If the system does not support those checks well, the limitation should be visible during evaluation, not after rollout.

What Good Looks Like

A good Text-to-SQL evaluation usually produces a decision like this:

  • safe to roll out now for these 20 questions and these two teams
  • promising, but needs better context for joins and definitions
  • not ready, because access handling or ambiguity recovery is weak

That is a much more useful conclusion than a single accuracy number. It tells you where the system is deployable and what must improve next.

How a Context Layer Makes the Evaluation More Realistic

The biggest evaluation mistake is testing the model in a condition your production stack will never use.

Kaelio auto-builds a governed context layer from your data stack. Its built-in data agent (and any MCP-compatible agent) can then deliver trusted, sourced answers to every team.

For evaluation work, that means your question set can be tested against the same metric definitions, relationships, lineage, and business documentation the production system will actually use. Instead of asking whether a model can infer everything from raw schemas, you ask whether the full governed system can answer your real questions safely and consistently.

That is the right bar for a data leader.

Practical Evaluation Workflow

Use this workflow:

  1. Collect 30 to 50 real questions.
  2. Group them by difficulty, ambiguity, and access sensitivity.
  3. Create an approved baseline for the most important questions.
  4. Test schema-only behavior.
  5. Test context-rich behavior with definitions, examples, and relationships.
  6. Replay questions as different user roles.
  7. Record correctness, consistency, and failure modes.
  8. Decide where the system is safe to launch first.

If you want the broader tool-selection version of this process, see how to evaluate AI analytics tools. If you want the concept background first, start with what is Text-to-SQL and how accurate are AI data analyst tools.

FAQ

Why are public Text-to-SQL benchmarks not enough for vendor evaluation?

Public benchmarks are useful for comparing general capability, but they do not capture your schema ambiguity, metric definitions, access policies, or business vocabulary. Production evaluation has to use your own data and your own questions.

How many test questions should I use in a Text-to-SQL evaluation?

Most teams should start with 30 to 50 real business questions. That is enough to expose failures in joins, permissions, ambiguity handling, and follow-up reasoning without turning the evaluation into a months-long research project.

What should count as a successful Text-to-SQL answer?

A successful answer is not just valid SQL. It returns the correct business result, uses the right joins and metric definitions, respects access controls, and provides enough transparency to be audited or corrected.

How do context and example queries affect Text-to-SQL accuracy?

Context and example queries usually matter a lot. BigQuery recommends structured context, glossaries, relationships, and example queries for higher-quality answers, and Snowflake uses verified queries to improve semantic models and answer a broader range of questions correctly.

How does Kaelio help evaluate and improve Text-to-SQL?

Kaelio auto-builds a governed context layer from your data stack. Its built-in data agent, and any MCP-compatible agent, can then evaluate questions against shared metric definitions, lineage, and source context, which makes Text-to-SQL tests more representative of production conditions.

Sources

Get Started

Give your data and analytics agents the context layer they deserve.

Auto-built. Governed by your team. Ready for any agent.

SOC 2 Compliant
256-bit Encryption
HIPAA