Concerns and Caveats

Concerns, Caveats & What This Data Can't Tell Us

Good analysis requires intellectual honesty. This page documents the limitations of the data, the known gaps in our analysis, and the places where reasonable people might draw different conclusions. If you're using any findings from this dashboard to inform decisions, start here.

Provenance

No Results

Methodological Guardrails

Default interpretation rule across this project: claims aggregates are descriptive unless explicitly marked otherwise. Causal claims require additional identification strategy, external controls, and outcome validation.

No Results

Data Quality Notes

**November-December 2024 data is incomplete.** Medicaid claims take weeks to months to adjudicate. We filter all analysis to October 2024 and earlier to avoid undercounting. Any apparent "decline" in late 2024 is almost certainly a data lag, not a real trend.

**9.5 million rows (4.2% of the dataset) have null servicing provider NPIs.** These rows still have billing provider NPIs and are included in spending totals, but any analysis that depends on identifying the actual practitioner who delivered care is missing these records.

**Small-cell suppression is in effect.** Rows with fewer than 12 claims are suppressed in the source data to protect patient privacy. This means we systematically undercount the smallest providers and the rarest procedures. The long tail of Medicaid — rural solo practitioners, uncommon diagnoses — is partially invisible.

**This is claims data, not encounter data.** A "claim" is a billing event. It can be denied, adjusted, reversed, or resubmitted. The "paid amount" field represents what was actually paid, but the claim counts and beneficiary counts reflect billing activity, not necessarily completed care episodes.

What This Data Doesn't Have

No dataset tells the whole story. Here's what's missing from this one:

No Patient Demographics

This dataset contains no information about age, race, gender, ethnicity, disability status, or any other demographic characteristic of beneficiaries. We cannot analyze disparities, identify underserved populations, or determine whether specific groups are being left behind. Given that Medicaid disproportionately serves communities of color, people with disabilities, and children, this is a significant blind spot.

Limited Geographic Resolution

The dataset includes provider NPIs but not their locations directly. We join against the CMS NPI Registry (NPPES) to map providers to states based on their registered practice address. This enables state-level analysis (see the Geographic Analysis page), but has limitations: provider location reflects their registered address, not necessarily where care is delivered (e.g., telehealth), and we cannot do sub-state analysis without additional ZIP-to-county mapping.

No Outcomes

We can see how much was spent and on what. We cannot see whether anyone got better. Claims data tells you that a prescription was filled — not whether the patient took the medication, whether it worked, or whether they're still alive. Spending is a poor proxy for health.

No Denial Data

This dataset shows claims that were paid. It does not show claims that were submitted and denied. If a provider requests authorization for a treatment and is refused, that event is invisible here. Denial rates are one of the most important access metrics — and we can't see them.

No Managed Care Capitation Payments

Medicaid is increasingly delivered through managed care organizations (MCOs) that receive capitated (per-member-per-month) payments. This dataset captures fee-for-service claims and some MCO encounter data, but it does not capture the full picture of managed care spending. States with heavy MCO penetration may appear to have lower spending in this data than they actually have.

Concentration Risks

Market concentration matters for access. When a small number of providers control a large share of spending in a service category, the exit of even one provider can create a crisis.

No Results

View SQL (`concentration`)

SELECT * FROM medicaid.concentration_by_category

Citation: concentration (source medicaid.concentration_by_category).

**How to read HHI:** The Herfindahl-Hirschman Index measures market concentration. The DOJ considers markets with HHI above 2,500 to be "highly concentrated." At the broad category level, Medicaid markets appear unconcentrated — but this can mask extreme concentration at the procedure + geography level. A state with only 3 home health agencies serving Medicaid patients is concentrated even if the national HHI is low.

The Top Provider Share column shows what percentage of total spending in each category flows through a single billing NPI. Even modest-looking percentages can represent billions of dollars and hundreds of thousands of beneficiaries. If that provider leaves the Medicaid program, those beneficiaries don't automatically find new care.

Known Limitations of Our Analysis

Beyond the data itself, our analytical choices introduce additional caveats:

HCPCS Category Mapping Is Approximate

We assign HCPCS procedure codes to categories (Mental Health, Surgery, Lab, etc.) using code range patterns. This is a reasonable heuristic but not a precise classification. Some codes fall at the boundary between categories, and the "Other" category is a catch-all for everything that doesn't match our patterns. Different analysts might draw these boundaries differently and get somewhat different results.

Beneficiary Counts Are Not Unique Individuals

When we say "beneficiaries" in this analysis, we mean the sum of TOTAL_UNIQUE_BENEFICIARIES across rows. A single person who sees three different providers in the same month appears three times — once in each provider's row. This means our beneficiary counts overstate the number of unique people. The overcount is especially significant in categories where patients see multiple providers (like mental health, where a patient might see a therapist, a psychiatrist, and a case manager).

Spending Means Paid Amounts, Not Charges or Costs

The "spending" figures throughout this dashboard represent amounts actually paid by Medicaid — not what providers charged (which is typically much higher) or what the care actually cost to deliver (which is different still). A provider might charge $200, receive $45 from Medicaid, and incur $60 in costs. We only see the $45.

Time Period Constraints

This data covers January 2018 through October 2024. We cannot observe trends from before 2018, and the most recent months are increasingly affected by claims lag. Year-over-year comparisons that include 2024 should be interpreted cautiously, as the 2024 data only covers 10 months.

COVID Distortions

The 2020-2021 period is heavily distorted by the pandemic. Utilization patterns during COVID were abnormal — deferred care, telehealth surges, the continuous enrollment provision, and emergency flexibilities all created artifacts in the data. Be cautious about drawing trend lines through this period.

A Note on Interpretation

Numbers don't speak for themselves. A rising spending trend could mean more people are getting needed care (good), or that prices are inflating without corresponding improvements (bad), or that sicker patients are entering the system (contextual). A declining trend could mean efficiency gains (good), or that people are losing access (bad), or that providers are leaving Medicaid (alarming).

Throughout this dashboard, we try to present the data clearly and note where multiple interpretations are plausible. We encourage readers to bring their own domain expertise and to treat these findings as starting points for investigation, not final answers.

Confidence Labels

We use lightweight confidence labels in narrative text:

Descriptive, High confidence: direct aggregate from claims data with limited interpretation
Descriptive, Medium confidence: direct aggregate with known attribution/mapping caveats
Inference, Medium confidence: interpretation of trends that could have multiple plausible causes
Inference, Low confidence: directional hypothesis needing external validation

**Data source:** HHS Medicaid Provider Spending dataset. Coverage: January 2018 through October 2024. Includes fee-for-service, managed care encounter, and CHIP claims data at the provider-procedure-month level.

Reproduce This Page

cd dashboard
export EVIDENCE_SOURCE__medicaid__token="<your_motherduck_token>"
export EVIDENCE_SOURCE__medicaid__database="medicaid"
npm run sources
npm run build
npm run preview
# then open http://localhost:3000/concerns