Methods — Sex Differences in Variability

What the public claim-status labels mean

Headline claim: core confirmatory rows with the strongest survey-design handling and no fallback flags. Only these rows feed the main headline claim.
Supporting evidence: inferential rows kept for added age or domain coverage, but still kept separate from the main headline claim.
Provisional: rows that depend on fallback weights or other temporary inference paths. They stay visible, but they do not count as headline evidence.
Method-limited: rows with usable point estimates but weaker uncertainty estimates, usually because only simple effective-sample-size approximations were available.
QA only: rows retained for transparency and diagnostics only, not for inferential claims.

Variance Ratio

The core effect-size estimator is the variance ratio:

VR = Var_male / Var_female

A value above 1.0× means males show greater score variability on that measure in that cell. A value below 1.0× means females show greater variability. A value near 1.0× means the sex-specific variances are approximately equal.

For example, a VR of 1.25× means male variance is 25% larger than female variance. A VR of 0.80× means male variance is 20% smaller.

Internally, the pipeline works on the log scale (ln VR) because variance ratios are multiplicative—the log transform makes male-greater and female-greater effects symmetric around zero, which simplifies standard-error estimation and cross-cell comparison. All public-facing results are reported on the natural VR scale.

Cell Construction

Each analysis cell is defined by the intersection of:

Dataset — one of the 14 live data sources
Cycle or wave — the assessment year, survey cycle, or longitudinal wave
Country — currently U.S. for all datasets
Age or grade band — e.g., kindergarten, grade 4, 30–34, 60–65
Trait — the specific measured variable (e.g., math achievement, numeracy, grip strength)

Within each cell, the pipeline computes weighted means and variances separately for males and females, applies the appropriate inference method, and records the variance ratio along with its standard error and confidence interval where available.

Survey Weight Handling

Most of these datasets use complex survey designs with stratification, clustering, and unequal selection probabilities. Ignoring survey weights can bias variance estimates. The pipeline uses weighted estimation throughout, with dataset-specific handling:

Plausible Values

Several large-scale assessments do not report a single score per student. Instead, they provide multiple plausible values—draws from the posterior distribution of each student’s proficiency. The pipeline averages variance estimates across all plausible-value draws within each cell, properly combining measurement uncertainty with sampling uncertainty.

Datasets using plausible values: PIAAC, PISA, TIMSS, PIRLS, ICILS.

Replicate Weights

Instead of relying on analytic variance formulas, many surveys provide sets of replicate weights that reflect the survey design. The pipeline uses the appropriate replication method for each dataset to estimate standard errors for the variance ratio.

Inference Modes

The pipeline supports five inference approaches, matched to what each dataset provides:

Inference methods by dataset
Method	Description	Datasets
PV + replicate weights	Plausible-value averaging combined with replicate-weight standard errors. The strongest inference mode, handling both measurement and sampling uncertainty.	PIAAC, PISA, TIMSS, PIRLS, ICILS
BRR	Balanced Repeated Replication. Uses Fay-adjusted replicate weights to estimate SEs without plausible values.	HSLS:09
JRR	Jackknife Repeated Replication. Uses jackknife zones and replicate indicators to estimate SEs.	TIMSS, PIRLS, ICILS (combined with PV)
Stratified PSU bootstrap	Design-aware bootstrap resampling using primary sampling unit and stratum metadata.	ECLS-K:2011, NHANES, NNYFS
Approximate household-cluster bootstrap	Bootstrap using tracker strata and household clusters as a proxy for the full survey design. Explicitly treated as method-limited.	HRS
Simple effective-sample-size approximation	Analytic SE formula using design-adjusted effective sample sizes. The weakest inference mode. Cells using this method are marked method-limited and excluded from headline claims.	NLSY79, NLSY97, NLSY79 Child/YA, PSID CDS/TAS

Evidence Status System

Every output cell carries an evidence-status label. These labels determine which cells are eligible for headline claims and which are retained only for transparency or supporting analysis.

Headline claim Inferential cell with confirmatory-priority trait, proper survey-design inference, and no provisional flags. Only these cells are used in the headline claim.
Supporting evidence Inferential cell from a secondary or exploratory-priority trait. Broadens domain coverage but is not part of the headline evidence.
Provisional Cell that relies on fallback weights, alternate public weight sources, or other provisional inference paths. Retained for sensitivity analysis but excluded from headline claims.
Method-limited Cell using simple-design SE approximation instead of replicate weights. Point estimates may be reasonable, but the uncertainty quantification is weaker.
QA only Cell that falls below sample-size thresholds, has bounded-scale issues, or is otherwise unsuitable for inferential use. Retained for quality-assurance diagnostics only.

This system ensures that public-facing summaries do not collapse strong and weak evidence into a single pooled claim. The public headline summary uses only rows labeled Headline claim.