Why LLMs Should Never Calculate Your Churn Rate
Intent is linguistic. Math is deterministic.
Swarnim Shrey
Founder, MindPalace
If you remember only one thing from this essay, remember the subtitle.
Every AI-native BI tool we have looked at is racing to let a large language model "talk to data." Demos look magical: ask a question in English, get a number, maybe even a chart. For a moment, it feels like we have finally killed the dashboard.
Under the hood, many of these systems are doing something irresponsible. They are letting a probabilistic language model calculate your business metrics. That is not innovation. That is a category error.
This post explains why LLMs should never compute churn, revenue, retention, or any metric you are about to make a decision on, and what a safer architecture looks like. We learned this the hard way while building MindPalace, so the examples are not hypothetical.
The seductive failure of chat-to-SQL
Most "chat with your data" tools follow the same pattern.
This works surprisingly well, for demos.
The problem is that this single-loop architecture quietly assigns three incompatible responsibilities to one model:
- Understanding intent
- Translating intent into logic
- Performing or validating the math
LLMs are excellent at the first task. They are passable at the second. They are fundamentally unqualified for the third.
Why this fails in the real world
Churn looks simple until it is not.
Ask ten companies how they calculate churn and you will get twelve answers:
- Logo churn vs revenue churn
- Gross vs net
- Monthly vs cohort-based
- Voluntary vs involuntary
- Trial users included or excluded
Each definition is contextual. Each one encodes business judgment that a Finance team made and a CRO signed off on.
When you ask an LLM "what is our churn last quarter?" you are not asking a math question. You are asking a specification question. And LLMs do not ask clarifying questions unless explicitly forced to. They assume.
Assumptions are poison in analytics. When we test generic chat-to-SQL tools against realistic warehouse schemas, the same pattern shows up: the SQL parses, the number comes back, the executive nods, and the answer is wrong. Worse, it is plausibly wrong. Wrong by 8 percent, not by 800. The kind of wrong that survives a glance and dies under scrutiny.
Probability vs determinism
LLMs work by predicting the most likely next token.
Statistics work by applying deterministic functions to data.
These are not adjacent disciplines. They are orthogonal. When an LLM generates SQL or computes a metric inline, you get four failure modes that all look the same from the outside:
- Silent assumption drift
- Inconsistent results across runs
- Undetectable logical errors
- Confidence without correctness
The output sounds right. That is the danger. A wrong number with high linguistic confidence is more damaging than a dashboard nobody trusts. With a distrusted dashboard, people verify. With a confident answer, they act.
The AI-native BI architecture mistake
Here is what plausibly-wrong looks like in practice. We reproduced this on our reference B2B SaaS dataset, modeled after a typical billing-and-usage warehouse. We asked a popular chat-to-SQL tool for "monthly churn." The model produced this:
SELECT 1.0 - COUNT(DISTINCT customer_id) FILTER (
WHERE last_active_at >= NOW() - INTERVAL '30 days'
)::float
/ COUNT(DISTINCT customer_id)
FROM customersIt returned 7.4 percent. The canonical definition for the same dataset, encoded in the semantic layer Finance uses, returns 6.8 percent. Off by 8 percent of the value. Looks fine, fails review.
The query has three quiet bugs that no LLM caught. It treats customers as the active denominator (it is not, the table includes trial accounts the canonical definition excludes). It uses "active in last 30 days" as the inverse of churn (the canonical version uses "billing event in current period," which is different). It ignores the is_paying flag that gates the trusted definition. None of these are visible in the SQL. All of them are visible in the semantic layer.
Many AI-native BI tools still collapse everything into one loop, though the better ones are starting to separate planning from execution. OpenAI's own internal data agent is one visible example of that separation done well, with a six-layer context graph for grounding and continuous evals to catch drift, and Anthropic's own data agent leans even harder on a governed semantic layer where the model maps the question and a function returns the number. The separation is what we built MindPalace around from day one. LLMs are planners, not calculators. It forced us to build a real semantic layer first, before any AI features. Cartographer exists because of that decision.
A safer split: planning vs execution
We separate the system into stages that have different jobs and different failure modes. One stage decides what to compute. One computes it. One puts it into words. Only the last stage is a language model, and it runs after the number is already final.
The plan (deterministic today)
Grounding resolves which metric definition applies: the canonical one Finance signed off on, bound to this customer's tables. The engine discovers the grain and picks the method from the shape of the data. When the question is "what changed," it z-scores the metric against a trailing baseline. When the question is "what explains the variance," it runs Kruskal-Wallis across the segments. None of this consults a language model. The plan is content-hashed, so it replays to the same SQL every time.
The natural-language front door, where you type a question in English and it becomes that plan, is the next layer we are building. Notice what it does and does not do. It reads the question and proposes a plan. It never selects the number. Even when the planner becomes a model, it stays a planner.
The analyzer (deterministic)
This engine is built on boring tools: Python, scipy, pandas, numpy. No creativity. No guessing. It runs the test, validates the assumptions, applies a Benjamini-Hochberg correction when it tests many segments at once, and fails loud when an assumption breaks.
The narrator (LLM)
The only place a language model runs is here, after the numbers are final. A single small model call turns the finished result into plain English: the headline, the takeaways, the recommendations. It can only describe numbers that already exist. When the data quality is too poor to narrate honestly, even this step drops to a deterministic template. The model is a narrator, not a calculator and not a planner.
Why we do not let LLMs "fix" the numbers
Some systems try to be clever:
- If SQL errors, ask the LLM to fix it
- If numbers look off, ask the LLM to adjust
- If results conflict, ask the LLM to reconcile
This creates a feedback loop where the model optimizes for plausibility, not truth.
Our analyzer is intentionally dumb:
- If assumptions are violated, it fails
- If data is insufficient, it stops
- If a metric's definition is ambiguous, it does not pick one. The conflict gets flagged for human validation before any number lands in front of a user.
Failure is a feature. Silent correction is not.
Below is a real shape of what we show. The plan, the assumption checks, and the test statistics all appear together so the audit trail is visible at the same time as the answer.
Deep Analysis
Is monthly churn higher in the SMB segment?
Plan
- Metric
- Monthly Logo Churn
- Population
- SMB vs Mid-Market vs Enterprise
- Window
- Last 6 calendar months
- Test
- Kruskal-Wallis
Assumption checks
- Sample adequacy per group
- Comparable distribution shape across groups
- Multiple-comparison correction (Benjamini-Hochberg)
- Independence of observations
H-statistic
16.31
p-value
0.0008
Effect size (η²)
0.41
Sample size
4,228
Illustrative output rendered from our reference SaaS dataset. In production, every result links to its full audit trail: the SQL that ran, the row counts at each step, and each of the assumption checks above with their pass conditions.
Why this matters for trust
Executives do not distrust data because they hate numbers. They distrust data because:
- Numbers change without explanation
- Metrics disagree across tools
- Nobody can trace why a value exists
Letting an LLM compute metrics accelerates all three problems. Separating intent from execution reverses them. Every result has a plan. Every plan has assumptions. Every assumption can be inspected. That is how trust gets built. Not with confidence, with traceability.
The real role of AI in analytics
AI should not replace your math. It should replace the coordination cost of analysis. Coordination cost is the time spent translating between business question, semantic definition, SQL, statistical method, and result. That is what AI compresses. The math itself was always the easy part.
In MindPalace specifically, the plan is a small content-hashed document produced for every analysis: the metric being asked about (resolved against the customer's semantic layer), the population in scope, the method chosen for the shape of the data (z-scores against a trailing baseline, or Kruskal-Wallis across segments with Mann-Whitney post-hoc when we rank drivers), and the assumptions that have to hold. None of that is a language model. Deep Analysis takes the plan, runs the math in Python with scipy and pandas, applies a Benjamini-Hochberg correction when it tests many segments at once, and returns a result with the assumption checks attached. The one place a model appears is the last step, where a single small model call turns the finished result into a plain-English summary. We keep the model out of the planning and the math on purpose.
That split lets us do four things at once:
- Pin every question to one precise, replayable plan
- Remember how each metric is defined in the company's semantic layer
- Pick the right statistical method for the shape of the data
- Show why a number exists, not just what it is
We have not seen an AI-native BI system work in production any other way. The ones that fail tend to fail the same way: a single LLM doing too many jobs at once, with nothing to catch it when it drifts.
A simple rule of thumb
If the output is a sentence, an LLM is appropriate.
If the output is a number you will make a decision on, an LLM should not touch it.
Intent is linguistic. Math is deterministic. Design your systems accordingly.
If you want to see how the planner-and-analyzer split actually runs in production, take a look at the product. If you want the long version of how we built the grounding layer underneath it, read what is a Decision Context Graph. If your team is currently the Human API between executives and the warehouse, that is the problem we are trying to dissolve.
Read this next
What is a Decision Context Graph? An Architectural Guide
A Decision Context Graph is the missing layer between your warehouse and your decisions. Here is what it is, how we build one in four hours, and why it matters now.
The Human API Problem
Data teams spend most of their week answering ad-hoc questions instead of building. We call this the Human API problem. Here is what it costs and what to do about it.
What OpenAI's Data Agent Teaches About AI-Native BI
OpenAI's internal data agent looks like a chatbot. Inside, it is a context graph. Here is what AI-native BI actually requires, and what most teams will miss.