Methodology
How We Score Counties.
Every weight, every threshold, every data source. We publish our methodology transparently because the organizations using this data deserve to know exactly how the numbers are produced.
Methodology v1.2.0 · Last updated April 2026
Overview
What the platform measures and why
Banana Analytics scores every US county (N ≈ 3,222) across multiple dimensions of health system need. Each dimension is scored 0–100 using national percentile ranking, where 100 represents the highest risk or greatest burden.
When two or more dimensions simultaneously exceed the 70th percentile, this convergence is flagged as a compound signal — a geographic co-occurrence of environmental exposure, population health burden, and healthcare access deficit that indicates an outsized and potentially underserved health need.
The platform produces two types of composite scores:
- Opportunity Score — a weighted composite of Environmental Risk, Disease Burden, and Provider Gap used for broad county prioritization.
- Named Compound Signals — four domain-specific risk scores (Respiratory Burden, Wildfire Smoke Vulnerability, Heat Health Risk, Industrial Pollution Burden) that combine exposure, disease, and access indicators into clinically interpretable composites.
Scoring Dimensions
Four dimensions of county-level health need
Environmental Risk Score (0–100)
Composite of air quality, water contamination, soil hazards, and climate extremes. Higher scores indicate greater environmental burden.
| Sub-domain | Weight | Indicators |
|---|---|---|
| Air Quality | 35% | PM2.5 annual mean (35%), Ozone (25%), TRI releases (20%), NO2 (10%), SO2 (10%) |
| Water Quality | 25% | PFAS severity score |
| Soil & Chemical | 20% | Radon zone classification (50%), Pesticide use in kg (50%) |
| Climate | 20% | Days above 95°F (60%), Avg summer max temperature (40%) |
Disease Burden Score (0–100)
Weighted prevalence of chronic conditions organized by clinical service line. Higher scores indicate greater population health burden.
| Service Line | Weight | Conditions |
|---|---|---|
| Respiratory | 25% | Current asthma (50%), COPD (50%) |
| Oncology | 20% | Cancer prevalence |
| Cardiovascular | 20% | Coronary heart disease (50%), Stroke (50%) |
| Endocrine | 15% | Diabetes |
| Renal | 10% | Kidney disease |
| Behavioral Health | 10% | Depression (50%), Frequent mental distress (50%) |
Provider Gap Score (0–100, inverted)
Healthcare specialists per 100,000 population, percentile-ranked and inverted so that counties with fewer providers score higher (indicating worse access). The score uses a 50/50 population-weighted blend of within-county and neighboring-county provider density (using Census Bureau county adjacency files), reducing false positives where a county borders a major medical center. Named compound signals use specialty-specific provider counts: pulmonology for respiratory and wildfire signals, cardiology for heat health risk.
SDOH Composite (0–100)
Social determinants of health composite spanning seven domains: food insecurity, housing instability, transportation barriers, utility difficulties, interpersonal safety, behavioral health access, and provider access. Derived from Census ACS, County Health Rankings, and CMS Geographic Variation data.
Composite Scoring
How dimensions become scores
Opportunity Score
Disease Burden receives the highest weight (35%) because chronic disease prevalence is the most direct indicator of population health need and service demand. Environmental Risk and Provider Gap each receive 25% as upstream exposure and access modifiers. SDOH Stress (15%) captures the social conditions that shape both access and outcomes. When SDOH data is unavailable for a county, the score falls back to the 3-dimension formula (30/40/30).
Adding SDOH as a fourth dimension measurably improved the model's predictive validity: correlations with county-level mortality increased by 0.08–0.10 points across every major cause of death compared to the 3-dimension specification.
Compound Signal Detection
A compound signal is declared when two or more scoring dimensions simultaneously exceed the 70th percentile:
| Tier | Criteria | County Count |
|---|---|---|
| No signal | 0–1 of 4 dimensions elevated | 2,492 (77%) |
| Moderate compound signal | 2 of 4 dimensions elevated | 604 (19%) |
| Strong compound signal | 3 of 4 dimensions elevated | 126 (4%) |
| Extreme compound signal | All 4 dimensions elevated | 1 (<0.1%) |
Normalization
All indicator values are converted to national percentile ranks (0–100) before weighting. This normalizes disparate units (µg/m³, prevalence %, provider counts) to a common scale. Values are capped at the 1st and 99th percentiles (winsorization) before ranking to limit outlier influence.
Missing data handling
Counties missing an indicator are scored on available data only. Weights are renormalized so missing data does not penalize. Each score carries a coverage and confidence classification:
| Confidence | Coverage | Interpretation |
|---|---|---|
| High | ≥90% of components present | Score is well-supported |
| Medium | 50–89% of components present | Score is reasonable but less certain |
| Low | <50% of components present | Interpret with caution |
Named Compound Signals
Four clinically interpretable risk composites
Each named signal combines an environmental exposure, a disease prevalence blend, and a provider access deficit into a single score. Multi-condition disease components use a 60/40 dominant/secondary blend: the higher-percentile condition receives 60% weight and the lower receives 40%, ensuring that counties with both conditions elevated score higher than those with only one.
Respiratory Burden
| Component | Weight | Source |
|---|---|---|
| PM2.5 annual mean | 40% | EPA AQS / EJSCREEN |
| Asthma + COPD blend | 30% | CDC PLACES |
| Pulmonology access deficit | 30% | NPPES (inverted) |
Wildfire Smoke Vulnerability
| Component | Weight | Source |
|---|---|---|
| Active fires within 200km | 35% | NIFC/WFIGS |
| 30-day max AQI | 20% | EPA AQS daily |
| Asthma + COPD blend | 25% | CDC PLACES |
| Pulmonology access deficit | 20% | NPPES (inverted) |
Heat Health Risk
| Component | Weight | Source |
|---|---|---|
| Avg summer max temperature | 40% | NOAA ACIS |
| CHD + Diabetes blend | 30% | CDC PLACES |
| Cardiology access deficit | 30% | NPPES (inverted) |
Industrial Pollution Burden
| Component | Weight | Source |
|---|---|---|
| TRI facility count | 35% | EPA Envirofacts |
| PFAS contamination | 25% | EPA UCMR5/ECHO |
| Pesticide usage (kg) | 20% | USGS PNSP |
| Total provider access deficit | 20% | NPPES (inverted) |
Data Sources & Freshness
14 federal data foundations
All data sources are publicly available US federal datasets. The platform refreshes data daily via automated Airflow pipelines. The “Typical Lag” column indicates how current the source data typically is at the time of ingestion. See the full data sources page for descriptions and current vintage dates.
| Source | Indicators | Refresh | Typical Lag |
|---|---|---|---|
| EPA AQS | PM2.5, Ozone, NO2, SO2 | Monthly | 6\u201318 months |
| EPA EJSCREEN | Modeled PM2.5, Diesel PM, Traffic proximity, Superfund proximity | Quarterly | 1 year |
| EPA TRI | Toxic releases, facility counts | Annual | 18 months |
| EPA UCMR5/ECHO | PFAS detections | Quarterly | 3\u20136 months |
| USGS PNSP | Pesticide use by county | Annual | 2\u20133 years |
| EPA Radon Zones | Radon zone classification | Static (1993) | N/A |
| NOAA ACIS | Days above 95\u00b0F, summer max temp | Monthly | Current year |
| CDC PLACES | Disease prevalence (9 conditions) | Monthly | 1\u20132 years |
| CDC WONDER | Cause-specific mortality (8 causes) | Monthly | 1\u20132 years |
| NCI State Cancer Profiles | Cancer incidence by site | Monthly | 2\u20133 years |
| CHR / NVSS | Low birth weight, infant mortality | Monthly | 1\u20132 years |
| CMS NPPES | Provider supply by specialty | Monthly | Current month |
| Census ACS 5-Year | Demographics, SDOH indicators | Quarterly | 1 year |
| CMS Geographic Variation | Medicare spending by category | Monthly | 1\u20132 years |
Sensitivity & Robustness
How stable are the results?
Weight stability
We test alternative weight vectors across reasonable perturbation ranges for every set of weights in the system:
- Environmental domain weights: Equal (25/25/25/25), air-heavy (45/20/15/20), climate-heavy (25/20/15/40), water-heavy (25/35/20/20)
- Opportunity score weights: Equal across all dimensions, disease-dominant (50%), environment-dominant (40%), and ±10 percentage point perturbations on each dimension
- Named signal weights: ±10 percentage point perturbations on the two highest-weighted components
For each alternative, we compute Spearman rank correlation with the baseline, count of counties that change compound signal tier, and stability of the top-25 and bottom-25 counties. The sensitivity analysis scripts are versioned alongside the scoring pipeline for reproducibility.
Threshold selection
The compound signal threshold (70th percentile) was evaluated by sweeping from the 50th to the 90th percentile in 5-point increments. The sweep measures how many counties qualify and what percentage of the US population they represent at each threshold.
The 70th percentile was selected to balance signal prevalence (enough qualifying counties to be useful for planning) with actionability (few enough that the signal remains discriminating). Below the 65th percentile, compound signals become too common to inform prioritization; above the 80th, they become too rare to support service line planning.
Limitations of weight selection
Dimension and indicator weights reflect structured expert judgment informed by epidemiological literature on environmental health linkages and health system service line economics. They are not empirically optimized against outcome data. We publish our weights transparently and provide sensitivity analysis tooling so users can assess robustness and explore alternative specifications.
Validation
Do the signals predict real-world outcomes?
The validation framework tests whether environmental and surveillance signals correlate with real-world health system utilization. We evaluated 32 signal-outcome pairs across 7 domains (ILI surveillance, wastewater COVID, temperature, humidity, heat index, air quality, severe storms) against outcomes including NHSN hospital admissions, NSSP emergency department visits, and bed occupancy rates.
Key findings
ILI surveillance correlations
r = 0.81\u20130.91
ILI surveillance signals strongly predict influenza hospitalizations (r = 0.81) and flu ED visits (r = 0.91) across all 50+ states. 100% of states show statistically significant associations.
Wastewater COVID signals
r = 0.70\u20130.79
COVID wastewater percentiles predict COVID ED visits (r = 0.79) and hospitalizations (r = 0.70) with 100% state significance. Granger causality confirmed in 58\u201370% of states.
Temperature and respiratory ED
r = 0.63
Weekly max temperature correlates with combined respiratory ED visits across all 48 reporting states. Granger causality confirmed in 73% of states.
Regional consistency
10 / 10 regions
All ten NOAA climate regions show significant ILI-to-hospitalization correlations, with mean |r| ranging from 0.47 (Alaska) to 0.89 (Northwest).
Opportunity Score vs. county-level mortality
The 4-dimension Opportunity Score correlates with mortality from CDC WONDER across every major cause of death (all p < 0.001):
| Mortality Outcome | Spearman r |
|---|---|
| All-cause mortality | +0.598 |
| Heart disease mortality | +0.546 |
| Chronic lower respiratory disease | +0.538 |
| Cancer mortality | +0.430 |
Environmental signal domain performance
| Signal Domain | Pairs Tested | Avg Composite Score |
|---|---|---|
| ILI Surveillance | 4 | 0.81 |
| Wastewater COVID | 2 | 0.81 |
| Humidity | 3 | 0.51 |
| Temperature | 7 | 0.31 |
| Heat Index | 6 | 0.28 |
| Air Quality | 6 | 0.16 |
| Severe Storms | 4 | 0.13 |
Composite scores combine correlation strength, statistical significance, Granger causality, and state-level consistency into a single 0–1 metric. Scores above 0.5 indicate strong, consistent predictive relationships.
Interpretation
The strongest validation comes from ILI and wastewater surveillance, where signals predict hospital utilization with high consistency across regions. Environmental exposure signals (temperature, humidity) show moderate but significant associations with respiratory ED visits. Air quality and storm signals show weaker county-level correlations, consistent with the ecological nature of the analysis and the diffuse exposure pathways involved.
These correlations are observed at the state level over time and do not establish individual-level causal pathways. They demonstrate that the signal domains tracked by the platform correspond to measurable patterns in health system utilization.
Limitations & Interpretation Guidance
What the scores can and cannot tell you
Ecological fallacy
All scores represent county-level aggregate patterns. Within-county variation may be substantial. A county with a high respiratory burden score may have asthma concentrated in specific neighborhoods near industrial sources, while other areas of the county are unaffected. Scores should not be interpreted as individual-level risk assessments.
Cross-sectional design
Compound signals identify geographic convergence of risk factors at a point in time. They do not establish temporal sequence or causation. A county with high environmental risk and high disease burden may reflect (a) environmental exposure contributing to disease, (b) population migration patterns that co-locate vulnerable groups with environmental hazards, or (c) shared upstream determinants affecting both.
Data lag
The platform integrates data sources with varying recency. CDC PLACES health estimates lag 1–2 years. EPA air quality data may lag 6–18 months. Provider counts from NPPES reflect registration, not active practice. Scores represent the best available composite picture, not a real-time snapshot.
NPPES limitations
Provider counts are derived from the National Plan and Provider Enumeration System, which records where providers register, not necessarily where they practice. To mitigate this, the Provider Gap score uses a 50/50 population-weighted blend of within-county and neighboring-county provider density. This adjacency adjustment reduces false positives where a county borders a major medical center, but does not fully resolve registration-vs-practice location discrepancies.
Weight subjectivity
Dimension and indicator weights reflect structured expert judgment informed by epidemiological literature and health system service line economics. They are not empirically derived from outcome data. Sensitivity analysis demonstrates that key findings are robust across reasonable weight perturbations.
Geographic resolution
The platform currently operates at county-level resolution. Sub-county patterns (ZIP code, census tract) may differ substantially from county-level scores. County-level analysis is most appropriate for regional planning and needs assessment; facility-level decisions require finer-grained analysis.
Versioning & Changelog
Methodology versions
The methodology is locked within a major version. Formula changes increment the minor version. Structural changes (new dimensions, new signals, validation-informed recalibration) increment the major version. Every scored output row is stamped with the methodology version that produced it.
- Opportunity Score expanded to 4 dimensions: added SDOH Stress (15%) alongside Environmental Risk (25%), Disease Burden (35%), Provider Gap (25%)
- Provider Gap now uses 50/50 population-weighted adjacency adjustment (within-county + neighboring-county blend)
- Compound signal detection expanded from 3 to 4 dimensions; tiers updated accordingly
- Validation correlations improved by 0.08–0.10 across all mortality outcomes with the 4-dimension model
- Cancer incidence data added (NCI State Cancer Profiles, 2,926 counties, 5 sites)
- Full EJSCREEN environmental justice proximity indicators (9 new fields)
- Demographics with race/ethnicity composition and pre-1978 housing lead risk proxy
- Historical trends for 8 CDC PLACES measures (2018–2023)
- Replaced max() with dominant/secondary blend (60/40) for multi-condition disease components in named signals
- Added specialty-specific provider gaps: pulmonology for respiratory and wildfire signals, cardiology for heat health risk
- Added components_present and coverage_pct metadata to all dimension scores
- Published sensitivity analysis across weight and threshold parameters
- Published validation against NHSN, NSSP, and CMS utilization data
- Added Limitations & Interpretation Guidance section
- Updated methodology text to use associational rather than causal language
- Initial release with 4 named compound signals
- max() aggregation for multi-condition disease components
- Total provider density (not specialty-specific)
Citation
How to cite this methodology
If you reference the methodology or its outputs in published work, community health needs assessments, or grant applications, please cite the version you used. The methodology version is displayed on every report generated by the platform.
See the methodology in action.
Explore environmental risk, disease burden, provider access, and compound signals for any US county.