AI Trust

Why LLMs Should Never Calculate Your Churn Rate

Intent is linguistic. Math is deterministic.

Swarnim Shrey

Founder, MindPalace

April 19, 20269 min read

If you remember only one thing from this essay, remember the subtitle.

Every AI-native BI tool we have looked at is racing to let a large language model "talk to data." Demos look magical: ask a question in English, get a number, maybe even a chart. For a moment, it feels like we have finally killed the dashboard.

Under the hood, many of these systems are doing something irresponsible. They are letting a probabilistic language model calculate your business metrics. That is not innovation. That is a category error.

This post explains why LLMs should never compute churn, revenue, retention, or any metric you are about to make a decision on, and what a safer architecture looks like. We learned this the hard way while building MindPalace, so the examples are not hypothetical.

The seductive failure of chat-to-SQL

Most "chat with your data" tools follow the same pattern.

Chat-to-SQL in one loop. The LLM does interpretation, translation, and math validation. Three jobs, one model, one weakest link.

This works surprisingly well, for demos.

The problem is that this single-loop architecture quietly assigns three incompatible responsibilities to one model:

Understanding intent
Translating intent into logic
Performing or validating the math

LLMs are excellent at the first task. They are passable at the second. They are fundamentally unqualified for the third.

Why this fails in the real world

Churn looks simple until it is not.

Ask ten companies how they calculate churn and you will get twelve answers:

Logo churn vs revenue churn
Gross vs net
Monthly vs cohort-based
Voluntary vs involuntary
Trial users included or excluded

Each definition is contextual. Each one encodes business judgment that a Finance team made and a CRO signed off on.

When you ask an LLM "what is our churn last quarter?" you are not asking a math question. You are asking a specification question. And LLMs do not ask clarifying questions unless explicitly forced to. They assume.

Assumptions are poison in analytics. When we test generic chat-to-SQL tools against realistic warehouse schemas, the same pattern shows up: the SQL parses, the number comes back, the executive nods, and the answer is wrong. Worse, it is plausibly wrong. Wrong by 8 percent, not by 800. The kind of wrong that survives a glance and dies under scrutiny.

Probability vs determinism

LLMs work by predicting the most likely next token.

Statistics work by applying deterministic functions to data.

These are not adjacent disciplines. They are orthogonal. When an LLM generates SQL or computes a metric inline, you get four failure modes that all look the same from the outside:

Silent assumption drift
Inconsistent results across runs
Undetectable logical errors
Confidence without correctness

The output sounds right. That is the danger. A wrong number with high linguistic confidence is more damaging than a dashboard nobody trusts. With a distrusted dashboard, people verify. With a confident answer, they act.

The AI-native BI architecture mistake

Here is what plausibly-wrong looks like in practice. We reproduced this on our reference B2B SaaS dataset, modeled after a typical billing-and-usage warehouse. We asked a popular chat-to-SQL tool for "monthly churn." The model produced this:

SELECT 1.0 - COUNT(DISTINCT customer_id) FILTER (
    WHERE last_active_at >= NOW() - INTERVAL '30 days'
  )::float
  / COUNT(DISTINCT customer_id)
FROM customers

It returned 7.4 percent. The canonical definition for the same dataset, encoded in the semantic layer Finance uses, returns 6.8 percent. Off by 8 percent of the value. Looks fine, fails review.

The query has three quiet bugs that no LLM caught. It treats customers as the active denominator (it is not, the table includes trial accounts the canonical definition excludes). It uses "active in last 30 days" as the inverse of churn (the canonical version uses "billing event in current period," which is different). It ignores the is_paying flag that gates the trusted definition. None of these are visible in the SQL. All of them are visible in the semantic layer.

Many AI-native BI tools still collapse everything into one loop, though the better ones are starting to separate planning from execution. OpenAI's own internal data agent is one visible example of that separation done well, with a six-layer context graph for grounding and continuous evals to catch drift, and Anthropic's own data agent leans even harder on a governed semantic layer where the model maps the question and a function returns the number. The separation is what we built MindPalace around from day one. LLMs are planners, not calculators. It forced us to build a real semantic layer first, before any AI features. Cartographer exists because of that decision.

A safer split: planning vs execution

We separate the system into stages that have different jobs and different failure modes. One stage decides what to compute. One computes it. One puts it into words. Only the last stage is a language model, and it runs after the number is already final.

The plan is the hand-off contract. The LLM cannot touch the math. The analyzer cannot reinterpret intent.

The plan (deterministic today)

Grounding resolves which metric definition applies: the canonical one Finance signed off on, bound to this customer's tables. The engine discovers the grain and picks the method from the shape of the data. When the question is "what changed," it z-scores the metric against a trailing baseline. When the question is "what explains the variance," it runs Kruskal-Wallis across the segments. None of this consults a language model. The plan is content-hashed, so it replays to the same SQL every time.

The natural-language front door, where you type a question in English and it becomes that plan, is the next layer we are building. Notice what it does and does not do. It reads the question and proposes a plan. It never selects the number. Even when the planner becomes a model, it stays a planner.

The analyzer (deterministic)

This engine is built on boring tools: Python, scipy, pandas, numpy. No creativity. No guessing. It runs the test, validates the assumptions, applies a Benjamini-Hochberg correction when it tests many segments at once, and fails loud when an assumption breaks.

The narrator (LLM)

The only place a language model runs is here, after the numbers are final. A single small model call turns the finished result into plain English: the headline, the takeaways, the recommendations. It can only describe numbers that already exist. When the data quality is too poor to narrate honestly, even this step drops to a deterministic template. The model is a narrator, not a calculator and not a planner.

Inside the analyzer. Failure is loud. Silent correction is not allowed.

Why we do not let LLMs "fix" the numbers

Some systems try to be clever:

If SQL errors, ask the LLM to fix it
If numbers look off, ask the LLM to adjust
If results conflict, ask the LLM to reconcile

This creates a feedback loop where the model optimizes for plausibility, not truth.

Our analyzer is intentionally dumb:

If assumptions are violated, it fails
If data is insufficient, it stops
If a metric's definition is ambiguous, it does not pick one. The conflict gets flagged for human validation before any number lands in front of a user.

Failure is a feature. Silent correction is not.

Below is a real shape of what we show. The plan, the assumption checks, and the test statistics all appear together so the audit trail is visible at the same time as the answer.

Deep Analysis

Is monthly churn higher in the SMB segment?

Significant across segments

Plan

Metric: Monthly Logo Churn
Population: SMB vs Mid-Market vs Enterprise
Window: Last 6 calendar months
Test: Kruskal-Wallis

Assumption checks

Sample adequacy per group
Comparable distribution shape across groups
Multiple-comparison correction (Benjamini-Hochberg)
Independence of observations

H-statistic

16.31

p-value

0.0008

Effect size (η²)

0.41

Sample size

4,228

Illustrative output rendered from our reference SaaS dataset. In production, every result links to its full audit trail: the SQL that ran, the row counts at each step, and each of the assumption checks above with their pass conditions.

If you cannot trace why a number exists, we did not earn the right to show it.

Why this matters for trust

Executives do not distrust data because they hate numbers. They distrust data because:

Numbers change without explanation
Metrics disagree across tools
Nobody can trace why a value exists

Letting an LLM compute metrics accelerates all three problems. Separating intent from execution reverses them. Every result has a plan. Every plan has assumptions. Every assumption can be inspected. That is how trust gets built. Not with confidence, with traceability.

The real role of AI in analytics

AI should not replace your math. It should replace the coordination cost of analysis. Coordination cost is the time spent translating between business question, semantic definition, SQL, statistical method, and result. That is what AI compresses. The math itself was always the easy part.

In MindPalace specifically, the plan is a small content-hashed document produced for every analysis: the metric being asked about (resolved against the customer's semantic layer), the population in scope, the method chosen for the shape of the data (z-scores against a trailing baseline, or Kruskal-Wallis across segments with Mann-Whitney post-hoc when we rank drivers), and the assumptions that have to hold. None of that is a language model. Deep Analysis takes the plan, runs the math in Python with scipy and pandas, applies a Benjamini-Hochberg correction when it tests many segments at once, and returns a result with the assumption checks attached. The one place a model appears is the last step, where a single small model call turns the finished result into a plain-English summary. We keep the model out of the planning and the math on purpose.

That split lets us do four things at once:

Pin every question to one precise, replayable plan
Remember how each metric is defined in the company's semantic layer
Pick the right statistical method for the shape of the data
Show why a number exists, not just what it is

We have not seen an AI-native BI system work in production any other way. The ones that fail tend to fail the same way: a single LLM doing too many jobs at once, with nothing to catch it when it drifts.

A simple rule of thumb

If the output is a sentence, an LLM is appropriate.

If the output is a number you will make a decision on, an LLM should not touch it.

Intent is linguistic. Math is deterministic. Design your systems accordingly.

If you want to see how the planner-and-analyzer split actually runs in production, take a look at the product. If you want the long version of how we built the grounding layer underneath it, read what is a Decision Context Graph. If your team is currently the Human API between executives and the warehouse, that is the problem we are trying to dissolve.