How to Evaluate Text-to-SQL on Your Own Data, Not Vendor Benchmarks
At a glance
- BIRD contains 12,751 question-SQL pairs across 95 databases and more than 37 professional domains, which makes it one of the strongest public baselines for Text-to-SQL evaluation.
- Spider 2.0 is built from 632 real-world enterprise workflow problems and reports that even
o1-previewsolves only 17.1 percent of tasks, versus 86.6 percent on Spider 1.0. - MMTU expands the evaluation lens beyond NL-to-SQL into broader table reasoning with about 28K questions across 25 real-world table tasks, showing that table intelligence is wider than one benchmark family.
- BigQuery conversational analytics states that direct conversations are less accurate than data agents because direct conversations do not include the same context and processing instructions.
- BigQuery authored context guidance recommends table descriptions, synonyms, relationships, glossaries, and example queries to improve response quality.
- Snowflake Cortex Analyst says generic schema-only approaches struggle because schemas lack business-process definitions and metric handling.
- Snowflake's verified-query workflow uses validated example queries to improve semantic models and broaden the range of questions the system can answer correctly.
- Snowflake Semantic Views emphasize that semantic definitions solve the gap between how business users describe data and how data is stored physically.
Public leaderboards are useful, but they are not procurement criteria. BIRD covers 12,751 question-SQL pairs across 95 real databases, while Spider 2.0 pushes into enterprise-style workflows with 632 tasks and shows even strong models struggling on real-world complexity. Neither benchmark knows what "qualified pipeline" means in your company.
What You Are Actually Evaluating
When a vendor says "our Text-to-SQL is accurate," the question is accurate at what?
For a data leader, the answer should cover at least five things:
- Business correctness: did the output match the answer your team would approve?
- Join correctness: were the tables and relationships valid?
- Definition correctness: did the system use the approved metric logic?
- Access correctness: did the answer respect the requesting user's permissions?
- Stability: does the system return the same answer when the question is asked a few different ways?
This is why public benchmarks help with orientation, not final selection. They tell you whether the model class is credible. They do not tell you whether the tool is safe to deploy on your warehouse.
Why Vendor Benchmarks Mislead
Benchmarks are real and useful. They are also easy to misuse.
BIRD and Spider 2.0 measure hard technical capability, and MMTU shows that table understanding goes well beyond simple SQL generation. But the things that break production analytics are often local:
- internal synonyms
- unofficial but common metric names
- deprecated tables that still look valid
- awkward join paths
- role-specific access boundaries
- business rules that live in docs instead of schemas
BigQuery acknowledges this directly by distinguishing between a context-rich data agent and a direct conversation. Snowflake Cortex Analyst makes the same point from another angle: schemas alone do not encode business-process definitions and metric handling.
If your evaluation ignores that local context, the result is not a production test. It is a product demo with extra steps.
Build a Test Set From Real Questions
Start with 30 to 50 real questions your business already asks.
Pull them from:
- Slack threads
- analyst tickets
- recurring executive asks
- dashboard follow-up requests
- finance and RevOps review meetings
Your test set should contain five categories:
1. Simple Metric Lookups
Examples:
- What was net revenue last quarter?
- How many enterprise customers churned last month?
These should pass, but they are not enough to differentiate tools.
2. Multi-Table Business Questions
Examples:
- Which opportunities above $100K have open support escalations and declining usage?
- Which pipeline segments have the highest conversion but lowest net retention?
These expose join quality and relationship reasoning.
3. Ambiguous Language
Examples:
- What is our best segment?
- Which customers are at risk?
These expose whether the system asks clarifying questions, uses approved defaults, or guesses.
4. Restricted-Access Questions
Examples:
- Ask the same finance question as a finance admin and as a sales manager.
These expose access leaks. A system that returns a correct answer to the wrong person still fails.
5. Follow-Up and Iterative Questions
Examples:
- Which region missed plan?
- Break that down by segment.
- Now exclude partner-sourced deals.
These expose whether the tool can maintain context without drifting.
Score Against Business Outcomes, Not SQL Syntax
A valid SQL query is not the same thing as a good answer.
Use a scoring sheet like this:
| Dimension | What to check |
|---|---|
| Business correctness | Does the answer match the approved result? |
| SQL correctness | Does the query run and use valid syntax? |
| Join correctness | Are the right tables and join paths used? |
| Metric correctness | Are official formulas, filters, and aggregations applied? |
| Access correctness | Does the result respect user-level permissions? |
| Transparency | Can the reviewer inspect SQL, sources, or reasoning? |
| Consistency | Does the tool return the same answer across paraphrases? |
| Recovery | Does it clarify or recover when the question is ambiguous? |
This scoring model is stricter than benchmark accuracy, but that is the point. Production analytics has stricter requirements than research comparison.
Evaluate With and Without Context
One of the clearest ways to understand a system is to test it in two modes:
- Schema-only or minimal-context mode
- Context-rich mode with definitions, relationships, glossary terms, and examples
BigQuery's authored context guide provides a concrete pattern: add descriptions, synonyms, tags, relationships, glossary definitions, and example queries. Snowflake uses verified queries to improve semantic concepts and answer a broader range of questions correctly.
This is an important test because it tells you whether the tool benefits from your current metadata maturity, and whether improving context is likely to improve outcomes after launch.
Require a Verified Baseline
For the most important questions, you need a reviewer-approved answer set.
That means:
- the expected result is documented
- the approved SQL exists when possible
- the metric definition is identified
- the owner signs off on correctness
Without a verified baseline, disagreement turns into opinion. With a baseline, you can measure progress and regressions.
This is also the best way to avoid false confidence. Many teams over-credit a system for being fluent when the real test is repeatable correctness.
Include Security and Governance in the Evaluation
Text-to-SQL evaluation fails when it is treated only as a modeling exercise.
Your evaluation should include:
- role-based question replay
- sensitive-field access checks
- audit-log availability
- citation or lineage inspection
- escalation behavior for low-confidence or out-of-scope prompts
If the system does not support those checks well, the limitation should be visible during evaluation, not after rollout.
What Good Looks Like
A good Text-to-SQL evaluation usually produces a decision like this:
- safe to roll out now for these 20 questions and these two teams
- promising, but needs better context for joins and definitions
- not ready, because access handling or ambiguity recovery is weak
That is a much more useful conclusion than a single accuracy number. It tells you where the system is deployable and what must improve next.
How a Context Layer Makes the Evaluation More Realistic
The biggest evaluation mistake is testing the model in a condition your production stack will never use.
Kaelio auto-builds a governed context layer from your data stack. Its built-in data agent (and any MCP-compatible agent) can then deliver trusted, sourced answers to every team.
For evaluation work, that means your question set can be tested against the same metric definitions, relationships, lineage, and business documentation the production system will actually use. Instead of asking whether a model can infer everything from raw schemas, you ask whether the full governed system can answer your real questions safely and consistently.
That is the right bar for a data leader.
Practical Evaluation Workflow
Use this workflow:
- Collect 30 to 50 real questions.
- Group them by difficulty, ambiguity, and access sensitivity.
- Create an approved baseline for the most important questions.
- Test schema-only behavior.
- Test context-rich behavior with definitions, examples, and relationships.
- Replay questions as different user roles.
- Record correctness, consistency, and failure modes.
- Decide where the system is safe to launch first.
If you want the broader tool-selection version of this process, see how to evaluate AI analytics tools. If you want the concept background first, start with what is Text-to-SQL and how accurate are AI data analyst tools.
FAQ
Why are public Text-to-SQL benchmarks not enough for vendor evaluation?
Public benchmarks are useful for comparing general capability, but they do not capture your schema ambiguity, metric definitions, access policies, or business vocabulary. Production evaluation has to use your own data and your own questions.
How many test questions should I use in a Text-to-SQL evaluation?
Most teams should start with 30 to 50 real business questions. That is enough to expose failures in joins, permissions, ambiguity handling, and follow-up reasoning without turning the evaluation into a months-long research project.
What should count as a successful Text-to-SQL answer?
A successful answer is not just valid SQL. It returns the correct business result, uses the right joins and metric definitions, respects access controls, and provides enough transparency to be audited or corrected.
How do context and example queries affect Text-to-SQL accuracy?
How does Kaelio help evaluate and improve Text-to-SQL?
Kaelio auto-builds a governed context layer from your data stack. Its built-in data agent, and any MCP-compatible agent, can then evaluate questions against shared metric definitions, lineage, and source context, which makes Text-to-SQL tests more representative of production conditions.
Sources
- https://bird-bench.github.io/
- https://spider2-sql.github.io/
- https://github.com/MMTU-Benchmark/MMTU
- https://docs.cloud.google.com/bigquery/docs/conversational-analytics
- https://docs.cloud.google.com/gemini/data-agents/conversational-analytics-api/data-agent-authored-context-bq
- https://docs.snowflake.com/en/en/user-guide/snowflake-cortex/cortex-analyst
- https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-analyst/analyst-optimization
- https://docs.snowflake.com/en/user-guide/views-semantic/overview