Banana AnalyticsBANANAANALYTICS

Methodology

How We Score Counties.

Every weight, every threshold, every data source. We publish our methodology transparently because the organizations using this data deserve to know exactly how the numbers are produced.

Methodology v1.2.0 · Last updated April 2026

Overview

What the platform measures and why

Banana Analytics scores every US county (N ≈ 3,222) across multiple dimensions of health system need. Each dimension is scored 0–100 using national percentile ranking, where 100 represents the highest risk or greatest burden.

When two or more dimensions simultaneously exceed the 70th percentile, this convergence is flagged as a compound signal — a geographic co-occurrence of environmental exposure, population health burden, and healthcare access deficit that indicates an outsized and potentially underserved health need.

The platform produces two types of composite scores:

  1. Opportunity Score — a weighted composite of Environmental Risk, Disease Burden, and Provider Gap used for broad county prioritization.
  2. Named Compound Signals — four domain-specific risk scores (Respiratory Burden, Wildfire Smoke Vulnerability, Heat Health Risk, Industrial Pollution Burden) that combine exposure, disease, and access indicators into clinically interpretable composites.

Scoring Dimensions

Four dimensions of county-level health need

Environmental Risk Score (0–100)

Composite of air quality, water contamination, soil hazards, and climate extremes. Higher scores indicate greater environmental burden.

Sub-domainWeightIndicators
Air Quality35%PM2.5 annual mean (35%), Ozone (25%), TRI releases (20%), NO2 (10%), SO2 (10%)
Water Quality25%PFAS severity score
Soil & Chemical20%Radon zone classification (50%), Pesticide use in kg (50%)
Climate20%Days above 95°F (60%), Avg summer max temperature (40%)

Disease Burden Score (0–100)

Weighted prevalence of chronic conditions organized by clinical service line. Higher scores indicate greater population health burden.

Service LineWeightConditions
Respiratory25%Current asthma (50%), COPD (50%)
Oncology20%Cancer prevalence
Cardiovascular20%Coronary heart disease (50%), Stroke (50%)
Endocrine15%Diabetes
Renal10%Kidney disease
Behavioral Health10%Depression (50%), Frequent mental distress (50%)

Provider Gap Score (0–100, inverted)

Healthcare specialists per 100,000 population, percentile-ranked and inverted so that counties with fewer providers score higher (indicating worse access). The score uses a 50/50 population-weighted blend of within-county and neighboring-county provider density (using Census Bureau county adjacency files), reducing false positives where a county borders a major medical center. Named compound signals use specialty-specific provider counts: pulmonology for respiratory and wildfire signals, cardiology for heat health risk.

SDOH Composite (0–100)

Social determinants of health composite spanning seven domains: food insecurity, housing instability, transportation barriers, utility difficulties, interpersonal safety, behavioral health access, and provider access. Derived from Census ACS, County Health Rankings, and CMS Geographic Variation data.

Composite Scoring

How dimensions become scores

Opportunity Score

Opportunity = 0.25 × Environmental Risk + 0.35 × Disease Burden + 0.25 × Provider Gap + 0.15 × SDOH Stress

Disease Burden receives the highest weight (35%) because chronic disease prevalence is the most direct indicator of population health need and service demand. Environmental Risk and Provider Gap each receive 25% as upstream exposure and access modifiers. SDOH Stress (15%) captures the social conditions that shape both access and outcomes. When SDOH data is unavailable for a county, the score falls back to the 3-dimension formula (30/40/30).

Adding SDOH as a fourth dimension measurably improved the model's predictive validity: correlations with county-level mortality increased by 0.08–0.10 points across every major cause of death compared to the 3-dimension specification.

Compound Signal Detection

A compound signal is declared when two or more scoring dimensions simultaneously exceed the 70th percentile:

TierCriteriaCounty Count
No signal0–1 of 4 dimensions elevated2,492 (77%)
Moderate compound signal2 of 4 dimensions elevated604 (19%)
Strong compound signal3 of 4 dimensions elevated126 (4%)
Extreme compound signalAll 4 dimensions elevated1 (<0.1%)

Normalization

All indicator values are converted to national percentile ranks (0–100) before weighting. This normalizes disparate units (µg/m³, prevalence %, provider counts) to a common scale. Values are capped at the 1st and 99th percentiles (winsorization) before ranking to limit outlier influence.

Missing data handling

Counties missing an indicator are scored on available data only. Weights are renormalized so missing data does not penalize. Each score carries a coverage and confidence classification:

ConfidenceCoverageInterpretation
High≥90% of components presentScore is well-supported
Medium50–89% of components presentScore is reasonable but less certain
Low<50% of components presentInterpret with caution

Named Compound Signals

Four clinically interpretable risk composites

Each named signal combines an environmental exposure, a disease prevalence blend, and a provider access deficit into a single score. Multi-condition disease components use a 60/40 dominant/secondary blend: the higher-percentile condition receives 60% weight and the lower receives 40%, ensuring that counties with both conditions elevated score higher than those with only one.

Respiratory Burden

ComponentWeightSource
PM2.5 annual mean40%EPA AQS / EJSCREEN
Asthma + COPD blend30%CDC PLACES
Pulmonology access deficit30%NPPES (inverted)

Wildfire Smoke Vulnerability

ComponentWeightSource
Active fires within 200km35%NIFC/WFIGS
30-day max AQI20%EPA AQS daily
Asthma + COPD blend25%CDC PLACES
Pulmonology access deficit20%NPPES (inverted)

Heat Health Risk

ComponentWeightSource
Avg summer max temperature40%NOAA ACIS
CHD + Diabetes blend30%CDC PLACES
Cardiology access deficit30%NPPES (inverted)

Industrial Pollution Burden

ComponentWeightSource
TRI facility count35%EPA Envirofacts
PFAS contamination25%EPA UCMR5/ECHO
Pesticide usage (kg)20%USGS PNSP
Total provider access deficit20%NPPES (inverted)

Data Sources & Freshness

14 federal data foundations

All data sources are publicly available US federal datasets. The platform refreshes data daily via automated Airflow pipelines. The “Typical Lag” column indicates how current the source data typically is at the time of ingestion. See the full data sources page for descriptions and current vintage dates.

SourceIndicatorsRefreshTypical Lag
EPA AQSPM2.5, Ozone, NO2, SO2Monthly6\u201318 months
EPA EJSCREENModeled PM2.5, Diesel PM, Traffic proximity, Superfund proximityQuarterly1 year
EPA TRIToxic releases, facility countsAnnual18 months
EPA UCMR5/ECHOPFAS detectionsQuarterly3\u20136 months
USGS PNSPPesticide use by countyAnnual2\u20133 years
EPA Radon ZonesRadon zone classificationStatic (1993)N/A
NOAA ACISDays above 95\u00b0F, summer max tempMonthlyCurrent year
CDC PLACESDisease prevalence (9 conditions)Monthly1\u20132 years
CDC WONDERCause-specific mortality (8 causes)Monthly1\u20132 years
NCI State Cancer ProfilesCancer incidence by siteMonthly2\u20133 years
CHR / NVSSLow birth weight, infant mortalityMonthly1\u20132 years
CMS NPPESProvider supply by specialtyMonthlyCurrent month
Census ACS 5-YearDemographics, SDOH indicatorsQuarterly1 year
CMS Geographic VariationMedicare spending by categoryMonthly1\u20132 years

Sensitivity & Robustness

How stable are the results?

Weight stability

We test alternative weight vectors across reasonable perturbation ranges for every set of weights in the system:

  • Environmental domain weights: Equal (25/25/25/25), air-heavy (45/20/15/20), climate-heavy (25/20/15/40), water-heavy (25/35/20/20)
  • Opportunity score weights: Equal across all dimensions, disease-dominant (50%), environment-dominant (40%), and ±10 percentage point perturbations on each dimension
  • Named signal weights: ±10 percentage point perturbations on the two highest-weighted components

For each alternative, we compute Spearman rank correlation with the baseline, count of counties that change compound signal tier, and stability of the top-25 and bottom-25 counties. The sensitivity analysis scripts are versioned alongside the scoring pipeline for reproducibility.

Threshold selection

The compound signal threshold (70th percentile) was evaluated by sweeping from the 50th to the 90th percentile in 5-point increments. The sweep measures how many counties qualify and what percentage of the US population they represent at each threshold.

The 70th percentile was selected to balance signal prevalence (enough qualifying counties to be useful for planning) with actionability (few enough that the signal remains discriminating). Below the 65th percentile, compound signals become too common to inform prioritization; above the 80th, they become too rare to support service line planning.

Limitations of weight selection

Dimension and indicator weights reflect structured expert judgment informed by epidemiological literature on environmental health linkages and health system service line economics. They are not empirically optimized against outcome data. We publish our weights transparently and provide sensitivity analysis tooling so users can assess robustness and explore alternative specifications.

Validation

Do the signals predict real-world outcomes?

The validation framework tests whether environmental and surveillance signals correlate with real-world health system utilization. We evaluated 32 signal-outcome pairs across 7 domains (ILI surveillance, wastewater COVID, temperature, humidity, heat index, air quality, severe storms) against outcomes including NHSN hospital admissions, NSSP emergency department visits, and bed occupancy rates.

Key findings

ILI surveillance correlations

r = 0.81\u20130.91

ILI surveillance signals strongly predict influenza hospitalizations (r = 0.81) and flu ED visits (r = 0.91) across all 50+ states. 100% of states show statistically significant associations.

Wastewater COVID signals

r = 0.70\u20130.79

COVID wastewater percentiles predict COVID ED visits (r = 0.79) and hospitalizations (r = 0.70) with 100% state significance. Granger causality confirmed in 58\u201370% of states.

Temperature and respiratory ED

r = 0.63

Weekly max temperature correlates with combined respiratory ED visits across all 48 reporting states. Granger causality confirmed in 73% of states.

Regional consistency

10 / 10 regions

All ten NOAA climate regions show significant ILI-to-hospitalization correlations, with mean |r| ranging from 0.47 (Alaska) to 0.89 (Northwest).

Opportunity Score vs. county-level mortality

The 4-dimension Opportunity Score correlates with mortality from CDC WONDER across every major cause of death (all p < 0.001):

Mortality OutcomeSpearman r
All-cause mortality+0.598
Heart disease mortality+0.546
Chronic lower respiratory disease+0.538
Cancer mortality+0.430

Environmental signal domain performance

Signal DomainPairs TestedAvg Composite Score
ILI Surveillance40.81
Wastewater COVID20.81
Humidity30.51
Temperature70.31
Heat Index60.28
Air Quality60.16
Severe Storms40.13

Composite scores combine correlation strength, statistical significance, Granger causality, and state-level consistency into a single 0–1 metric. Scores above 0.5 indicate strong, consistent predictive relationships.

Interpretation

The strongest validation comes from ILI and wastewater surveillance, where signals predict hospital utilization with high consistency across regions. Environmental exposure signals (temperature, humidity) show moderate but significant associations with respiratory ED visits. Air quality and storm signals show weaker county-level correlations, consistent with the ecological nature of the analysis and the diffuse exposure pathways involved.

These correlations are observed at the state level over time and do not establish individual-level causal pathways. They demonstrate that the signal domains tracked by the platform correspond to measurable patterns in health system utilization.

Limitations & Interpretation Guidance

What the scores can and cannot tell you

Ecological fallacy

All scores represent county-level aggregate patterns. Within-county variation may be substantial. A county with a high respiratory burden score may have asthma concentrated in specific neighborhoods near industrial sources, while other areas of the county are unaffected. Scores should not be interpreted as individual-level risk assessments.

Cross-sectional design

Compound signals identify geographic convergence of risk factors at a point in time. They do not establish temporal sequence or causation. A county with high environmental risk and high disease burden may reflect (a) environmental exposure contributing to disease, (b) population migration patterns that co-locate vulnerable groups with environmental hazards, or (c) shared upstream determinants affecting both.

Data lag

The platform integrates data sources with varying recency. CDC PLACES health estimates lag 1–2 years. EPA air quality data may lag 6–18 months. Provider counts from NPPES reflect registration, not active practice. Scores represent the best available composite picture, not a real-time snapshot.

NPPES limitations

Provider counts are derived from the National Plan and Provider Enumeration System, which records where providers register, not necessarily where they practice. To mitigate this, the Provider Gap score uses a 50/50 population-weighted blend of within-county and neighboring-county provider density. This adjacency adjustment reduces false positives where a county borders a major medical center, but does not fully resolve registration-vs-practice location discrepancies.

Weight subjectivity

Dimension and indicator weights reflect structured expert judgment informed by epidemiological literature and health system service line economics. They are not empirically derived from outcome data. Sensitivity analysis demonstrates that key findings are robust across reasonable weight perturbations.

Geographic resolution

The platform currently operates at county-level resolution. Sub-county patterns (ZIP code, census tract) may differ substantially from county-level scores. County-level analysis is most appropriate for regional planning and needs assessment; facility-level decisions require finer-grained analysis.

Versioning & Changelog

Methodology versions

The methodology is locked within a major version. Formula changes increment the minor version. Structural changes (new dimensions, new signals, validation-informed recalibration) increment the major version. Every scored output row is stamped with the methodology version that produced it.

v1.2.0April 2026
  • Opportunity Score expanded to 4 dimensions: added SDOH Stress (15%) alongside Environmental Risk (25%), Disease Burden (35%), Provider Gap (25%)
  • Provider Gap now uses 50/50 population-weighted adjacency adjustment (within-county + neighboring-county blend)
  • Compound signal detection expanded from 3 to 4 dimensions; tiers updated accordingly
  • Validation correlations improved by 0.08–0.10 across all mortality outcomes with the 4-dimension model
  • Cancer incidence data added (NCI State Cancer Profiles, 2,926 counties, 5 sites)
  • Full EJSCREEN environmental justice proximity indicators (9 new fields)
  • Demographics with race/ethnicity composition and pre-1978 housing lead risk proxy
  • Historical trends for 8 CDC PLACES measures (2018–2023)
v1.1.0April 2026
  • Replaced max() with dominant/secondary blend (60/40) for multi-condition disease components in named signals
  • Added specialty-specific provider gaps: pulmonology for respiratory and wildfire signals, cardiology for heat health risk
  • Added components_present and coverage_pct metadata to all dimension scores
  • Published sensitivity analysis across weight and threshold parameters
  • Published validation against NHSN, NSSP, and CMS utilization data
  • Added Limitations & Interpretation Guidance section
  • Updated methodology text to use associational rather than causal language
v1.0.0March 2026
  • Initial release with 4 named compound signals
  • max() aggregation for multi-condition disease components
  • Total provider density (not specialty-specific)

Citation

How to cite this methodology

Banana Analytics. (2026). Compound Signal Scoring Methodology, v1.2.0. Banana Analytics Technical Documentation. https://banana-analytics.com/methodology

If you reference the methodology or its outputs in published work, community health needs assessments, or grant applications, please cite the version you used. The methodology version is displayed on every report generated by the platform.

See the methodology in action.

Explore environmental risk, disease burden, provider access, and compound signals for any US county.