What Is Text-to-SQL? Accuracy, Limitations, and How Context Layers Fix the Gaps
At a glance
- Text-to-SQL is the task of automatically translating natural language questions into SQL queries that run against a real database, as evaluated on benchmarks like Spider and BIRD.
- It powers most modern conversational analytics and AI data analyst experiences built on benchmark tasks like Spider and BIRD.
- The leading public benchmarks are Spider, Spider 2.0, and BIRD, which uses 12,751 question-SQL pairs across 95 real databases (Li et al., 2023).
- Modern systems combine LLMs with schema linking, in-context examples, and formal semantics such as the dbt Semantic Layer.
- Production accuracy gaps remain visible on enterprise-oriented benchmarks like BIRD and Spider 2.0, which is why schema quality and business context matter as much as raw SQL generation.
- A governed context layer closes those gaps by giving the model canonical definitions, valid joins, lineage, and access policies before it writes SQL, often exposed through standards like MCP and semantic systems like the dbt Semantic Layer.
On the BIRD benchmark, top text-to-SQL systems score in the mid-70 percent range across 95 real databases, while leaders on the harder Spider 2.0 benchmark remain below 60 percent. Kaelio auto-builds a governed context layer from your data stack. Its built-in data agent (and any MCP-compatible agent) can then deliver trusted, sourced answers to every team.
A Working Definition
Text-to-SQL takes two inputs and produces one output:
- Inputs: a natural language question (for example, "What was net revenue last quarter, excluding test accounts?") and a representation of the database (schemas, sample rows, descriptions, sometimes example queries or a semantic model).
- Output: a SQL query that, when executed against the database, returns the correct answer.
That definition is small but the implications are not. "Correct" is doing a lot of work. A query that runs without errors and returns numbers is not the same as a query that returns the answer the business actually meant. The whole field is essentially a long argument about how to close that gap.
A Short History
Text-to-SQL is older than the current AI cycle.
Pre-2017: rule-based and slot-filling systems. Templated parsers handled narrow domains well and broke as soon as queries diverged from expected patterns.
2017 to 2020: neural sequence-to-sequence models. Models like Seq2SQL and SQLNet learned to generate SQL tokens directly. Datasets like WikiSQL and Spider emerged as the first serious benchmarks. Spider in particular forced models to handle complex joins, subqueries, and unseen schemas.
2021 to 2023: large language models. Pretrained LLMs proved much better at the task without specialized architectures. Approaches like DIN-SQL and DAIL-SQL showed that decomposition, schema linking, and in-context examples close large gaps. The BIRD benchmark raised the difficulty by introducing real-world databases and dirty data.
2024 to 2026: agent loops and context layers. The frontier moved from "generate the right SQL in one shot" to "use tools, query metadata, look at examples, refine, and explain." Benchmarks like Spider 2.0 push toward enterprise-scale schemas and multi-step workflows. The shared insight is that grounding in governed business context is now the dominant variable.
How Modern Text-to-SQL Systems Work
Most production systems share roughly the same pipeline.
1. Schema Linking
Given the question, the system identifies which tables and columns are likely relevant. On a database with thousands of tables, sending all of them to the model is wasteful and noisy. Schema linking narrows the candidate set using vector search, name matching, foreign-key graph traversal, and (when available) semantic descriptions.
Schema linking quality is one of the largest single drivers of end-to-end accuracy. The BIRD paper and follow-up work show that models given the right tables and columns up front dramatically outperform models forced to discover them from scratch.
2. Context Assembly
Once the candidate schema is identified, the system assembles the context the model will see: column descriptions, primary and foreign keys, example values, joinability hints, and (in mature systems) governed metric definitions and business rules.
This is where many systems quietly diverge. A pure schema crawl gives the model raw structure. A governed context layer gives the model the organization's interpretation of that structure, which is what most business questions actually need.
3. SQL Generation
The model generates a candidate SQL query, often using techniques like decomposition (break the question into smaller subqueries), self-consistency (generate several candidates and vote), or constrained decoding (force the output to be valid SQL). DAIL-SQL and DIN-SQL are influential examples of these patterns.
4. Validation and Repair
Strong systems do not stop at generation. They run the SQL (or a dry-run version), inspect errors, and repair. If a column does not exist, the system retries. If the result is empty when it should not be, the system reconsiders the join logic.
5. Answer Synthesis
Finally, the system turns the SQL result into a human-readable answer, often with a chart or summary. In governed systems, this step also attaches the citation: the metric definition used, the source tables, the filters applied. This citation is what makes the answer auditable.
How Accurate Is Text-to-SQL?
The honest answer: better than people expect on benchmarks, worse than people expect in production.
On Public Benchmarks
- Spider (2018) was considered hard at release. Top systems now score above 90 percent execution accuracy.
- BIRD introduced 95 real databases and 12,751 question-SQL pairs. Top systems score in the mid-70 percent range on execution accuracy (Li et al., 2023).
- Spider 2.0 was designed specifically to reflect enterprise-scale workflows: large schemas, multi-step reasoning, dialect variation, and complex transformations. Top systems currently score below 60 percent.
The trajectory matters more than the absolute numbers. Each generation of benchmark has been "solved" within a few years of release, then replaced by a harder one. The frontier has shifted from "generate any valid SQL" to "do the work a real analyst would do, in the conditions a real analyst faces."
On Production Schemas
Real enterprise schemas are typically:
- Larger. Thousands of tables, often with hundreds of columns each.
- Less documented. Many columns have cryptic names and no descriptions.
- More ambiguous. Multiple "revenue" columns, multiple "customer" tables, multiple ways to join the same entities.
- Drifting. Schemas change continuously, and stale documentation is the norm rather than the exception. We covered this in how to prevent schema drift from breaking your AI data agent.
Production accuracy usually starts below benchmark accuracy because enterprise schemas are larger, less documented, and more ambiguous than benchmark datasets. That gap is why serious evaluations rely on a company's own question set rather than leaderboard numbers alone. For an empirical look at the trust gap, see can AI analytics tools be trusted with business metrics and how accurate are AI data analyst tools.
Where Text-to-SQL Breaks (and Why)
The most common failure modes are not random.
Ambiguous column names. A table has revenue, total_revenue, and net_revenue. Without context, the model picks the most plausible. With a governed definition, it picks the canonical one.
Implicit join logic. Two tables can be joined on multiple keys, each producing different cardinality. Models that infer joins from foreign keys often pick a path that compiles but quietly fans out the result.
Inconsistent definitions across tools. "Active users" in one dashboard uses a 7-day window. In another it uses 28 days. The model averages over the implicit rule it learned from training data, which matches neither.
Deprecated tables that look canonical. A table has not been updated in months but still exists. Without lineage, the model treats it as authoritative.
Missing business rules. "Exclude internal accounts." "Use net of refunds." "Annual contracts annualize differently." These rules live in someone's head or in a wiki, not in the schema.
Permission mismatches. The agent runs with broad permissions and surfaces data the requesting user is not allowed to see. This is a row-level security failure dressed up as an accuracy problem.
Notice the pattern. None of these are model intelligence failures. They are context failures. The model writes the SQL the schema told it to write. The schema is incomplete.
How a Context Layer Fixes the Gaps
A governed context layer is the implementation pattern that consistently closes the production accuracy gap.
It does four things that raw schema cannot:
- Single source of truth definitions. Canonical metrics, dimensions, and filters live in one place. The model resolves "revenue" to the same query every time.
- Valid join paths and lineage. The context layer encodes which joins are correct, which produce fan-out, and which tables are deprecated. The model picks the right path because the wrong ones are hidden.
- Domain knowledge. Business rules from wikis, runbooks, and BI tools are ingested and indexed. The model picks up the "exclude internal accounts" rule because it appears in the context, not because someone remembered to put it in the prompt.
- Access enforcement. Row and column policies are inherited at query time. The agent can only generate SQL the requesting user is allowed to run.
The result is the same model writing dramatically better SQL because it is no longer guessing about the parts of the question that have an authoritative answer.
Where Kaelio Fits
Text-to-SQL is one of many things a data agent does. Kaelio is not the category itself; it is the context layer that makes text-to-SQL reliable.
Kaelio auto-builds a governed context layer by connecting to your warehouses (Snowflake, BigQuery, Databricks, Redshift), transformation tools (dbt, including the dbt Semantic Layer), BI tools (Looker, Tableau, Metabase, Power BI), and documentation (Confluence, Notion, Google Docs) through 900+ connectors.
The governed context is exposed via Model Context Protocol and a REST API. Kaelio's built-in data agent (and any MCP-compatible agent such as Claude or ChatGPT) queries the context before generating SQL, inheriting canonical definitions, valid joins, lineage, and access policies. Every answer shows reasoning, lineage, and data sources, so the user can see which definitions and tables were used.
This is why teams evaluating governed AI analytics tools look past raw SQL-generation skill alone: richer context matters more than another small model gain.
Practical Guidance for Teams Evaluating Text-to-SQL
If you are evaluating a text-to-SQL system or AI data analyst tool, the right test is not "can it write SQL?" The right test is whether it can answer your real business questions consistently and citably.
A useful evaluation pattern:
- Prepare a question set of 30 to 50 real questions drawn from Slack, email, and analyst tickets. Include both simple lookups and questions that require multiple joins, business rules, or filters.
- Run each question three times. Measure consistency, not just one-shot accuracy. Inconsistent answers are worse than wrong-but-consistent answers because they are harder to detect.
- Compare against the answer the data team would write. "Technically valid SQL" is not the bar. "Matches the answer a senior analyst would produce" is.
- Check the citations. A correct answer without sourcing is not auditable. An incorrect answer with sourcing is at least debuggable.
- Test access enforcement. Pose the same question as a different user with narrower permissions and confirm the agent respects them.
For a fuller framework, see how to evaluate AI analytics tools and how to choose an analytics copilot you can actually trust.
FAQ
What is text-to-SQL?
Text-to-SQL is the task of automatically converting a natural language question into an executable SQL query. Modern text-to-SQL systems use large language models combined with schema information, in-context examples, and (when available) governed business context to generate accurate queries against real databases.
How accurate is text-to-SQL today?
On the BIRD benchmark, which evaluates text-to-SQL on 95 real databases, top systems score in the mid-70 percent range on execution accuracy. On the harder Spider 2.0 benchmark, top systems score below 60 percent. Real-world enterprise accuracy is usually lower than benchmarks because production schemas are larger, less documented, and more ambiguous than benchmark databases.
Why does text-to-SQL fail in production?
The most common failure modes are ambiguous column names, missing or implicit join logic, inconsistent metric definitions across tools, deprecated tables that look canonical, and missing business rules such as "exclude internal accounts" or "use net revenue, not gross." These are context problems, not model intelligence problems.
How does a context layer improve text-to-SQL accuracy?
A governed context layer provides the model with canonical metric definitions, valid join paths, lineage, access policies, and domain knowledge before it generates SQL. The model is constrained to the organization's actual data semantics rather than guessing from column names, which closes the gap between technically valid SQL and a correct business answer.
Is text-to-SQL ready for production analytics?
Text-to-SQL is production-ready when paired with a governed context layer, audit logging, and clear escalation paths for low-confidence answers. Without that infrastructure, accuracy and trust degrade quickly on real business questions, and the model's confidence does not reflect its actual correctness. The combination of strong models and governed context, not the model alone, is what makes text-to-SQL reliable.