What the public claim-status labels mean
- Headline claim: core confirmatory rows with the strongest survey-design handling and no fallback flags. Only these rows feed the main headline claim.
- Supporting evidence: inferential rows kept for added age or domain coverage, but still kept separate from the main headline claim.
- Provisional: rows that depend on fallback weights or other temporary inference paths. They stay visible, but they do not count as headline evidence.
- Method-limited: rows with usable point estimates but weaker uncertainty estimates, usually because only simple effective-sample-size approximations were available.
- QA only: rows retained for transparency and diagnostics only, not for inferential claims.
Variance Ratio
The core effect-size estimator is the variance ratio:
A value above 1.0× means males show greater score variability on that measure in that cell. A value below 1.0× means females show greater variability. A value near 1.0× means the sex-specific variances are approximately equal.
For example, a VR of 1.25× means male variance is 25% larger than female variance. A VR of 0.80× means male variance is 20% smaller.
Internally, the pipeline works on the log scale (ln VR) because variance ratios are multiplicative—the log transform makes male-greater and female-greater effects symmetric around zero, which simplifies standard-error estimation and cross-cell comparison. All public-facing results are reported on the natural VR scale.
Cell Construction
Each analysis cell is defined by the intersection of:
- Dataset — one of the 14 live data sources
- Cycle or wave — the assessment year, survey cycle, or longitudinal wave
- Country — currently U.S. for all datasets
- Age or grade band — e.g., kindergarten, grade 4, 30–34, 60–65
- Trait — the specific measured variable (e.g., math achievement, numeracy, grip strength)
Within each cell, the pipeline computes weighted means and variances separately for males and females, applies the appropriate inference method, and records the variance ratio along with its standard error and confidence interval where available.
Survey Weight Handling
Most of these datasets use complex survey designs with stratification, clustering, and unequal selection probabilities. Ignoring survey weights can bias variance estimates. The pipeline uses weighted estimation throughout, with dataset-specific handling:
Plausible Values
Several large-scale assessments do not report a single score per student. Instead, they provide multiple plausible values—draws from the posterior distribution of each student’s proficiency. The pipeline averages variance estimates across all plausible-value draws within each cell, properly combining measurement uncertainty with sampling uncertainty.
Datasets using plausible values: PIAAC, PISA, TIMSS, PIRLS, ICILS.
Replicate Weights
Instead of relying on analytic variance formulas, many surveys provide sets of replicate weights that reflect the survey design. The pipeline uses the appropriate replication method for each dataset to estimate standard errors for the variance ratio.
Inference Modes
The pipeline supports five inference approaches, matched to what each dataset provides:
| Method | Description | Datasets |
|---|---|---|
| PV + replicate weights | Plausible-value averaging combined with replicate-weight standard errors. The strongest inference mode, handling both measurement and sampling uncertainty. | PIAAC, PISA, TIMSS, PIRLS, ICILS |
| BRR | Balanced Repeated Replication. Uses Fay-adjusted replicate weights to estimate SEs without plausible values. | HSLS:09 |
| JRR | Jackknife Repeated Replication. Uses jackknife zones and replicate indicators to estimate SEs. | TIMSS, PIRLS, ICILS (combined with PV) |
| Stratified PSU bootstrap | Design-aware bootstrap resampling using primary sampling unit and stratum metadata. | ECLS-K:2011, NHANES, NNYFS |
| Approximate household-cluster bootstrap | Bootstrap using tracker strata and household clusters as a proxy for the full survey design. Explicitly treated as method-limited. | HRS |
| Simple effective-sample-size approximation | Analytic SE formula using design-adjusted effective sample sizes. The weakest inference mode. Cells using this method are marked method-limited and excluded from headline claims. | NLSY79, NLSY97, NLSY79 Child/YA, PSID CDS/TAS |
Evidence Status System
Every output cell carries an evidence-status label. These labels determine which cells are eligible for headline claims and which are retained only for transparency or supporting analysis.
- Headline claim Inferential cell with confirmatory-priority trait, proper survey-design inference, and no provisional flags. Only these cells are used in the headline claim.
- Supporting evidence Inferential cell from a secondary or exploratory-priority trait. Broadens domain coverage but is not part of the headline evidence.
- Provisional Cell that relies on fallback weights, alternate public weight sources, or other provisional inference paths. Retained for sensitivity analysis but excluded from headline claims.
- Method-limited Cell using simple-design SE approximation instead of replicate weights. Point estimates may be reasonable, but the uncertainty quantification is weaker.
- QA only Cell that falls below sample-size thresholds, has bounded-scale issues, or is otherwise unsuitable for inferential use. Retained for quality-assurance diagnostics only.
This system ensures that public-facing summaries do not collapse strong and weak evidence into a single pooled claim. The public headline summary uses only rows labeled Headline claim.