The missing heritability question is now (mostly) answered
Not with a bang but with a whimper
The “missing heritability” conundrum goes like this: (1) twin studies, which contrast phenotypic correlations between monozygotic and dizygotic twins, tend to estimate the heritability of common traits at 50-60% on average; (2) genome-wide association studies (GWAS), which sum up the trait association of individual common mutations, tend to estimate the heritability of common traits at 20-30% — so what explains the other 20-30%? Both methods have their limitations: twin studies assume there are no environmental confounders and no interactions; GWAS only measures a subset of common variants. So are twin studies getting the environment wrong or is GWAS missing a huge tranche of trait-relevant rare variation? This debate has gone on for nearly two decades since GWAS entered the picture, but the debate over environmental interactions goes all the way back to the dawn of quantitative genetics itself. Now, with two recent molecular studies, we have an answer.
Beyond twin studies, there are several ways of estimating total “direct” heritability: i.e. the fraction of phenotypic variance that is due to genetic variants which act directly within an individual on their phenotype1.
The best approach is to use the random genetic variation within a family by effectively conditioning on the parental genotypes — proposed in a method known as Relatedness Disequilibrium Regression (or RDR). By using within-family variation, RDR is not susceptible to environmental confounding from relatives or through population stratification. By using primarily family “trios” with one child, RDR is inherently immune to biases from sibling environmental relationships. And by drawing most of its signal from the pairs of individuals between different families (after accounting for genetic confounding via their parents), RDR does not pick up environmental or genetic interactions which distant individuals do not share. RDR thus estimates “narrow-sense” heritability in its most well-defined form.
The next best approach is to use the random genetic sharing between siblings — a method known as Sibling Regression (or SR). By using within-family variation, like RDR, SR is not confounded by stratification and shared environment. But because it uses only pairs of siblings, SR has to assume that siblings do not systematically influence each other (no “sibling indirect effects”). And because siblings are not distant relatives, SR will also pick up the influence of within-family gene-gene and gene-environment interactions. So SR estimates something that is between narrow-sense and broad-sense heritability (if you consider broad-sense heritability to include GxE).
The third best approach is to brute force it: simply measure every single mutation in the genome in a large population of unrelated individuals, typically using a whole-genome sequencing (WGS) assay, and then put all of those mutations into one of several well-established GWAS heritability estimators (we’ll call this GREML-WGS2). This approach does not have the advantage of using within-family variation, and therefore will include any environmental influences that are correlated with genetics, such as familial factors or stratification. That means GREML-WGS estimates are essentially untethered from narrow- or broad- sense heritability because they can include entirely non-genetic variance. But many biobank traits, like lipid levels or blood counts, are probably not under the strong indirect influence of parental genetics. So GREML-WGS can act as a crude confirmatory analysis, as well as a way to partition the contribution of measurable rare and common variation.
Finally, the least interpretable way to estimate heritability is to contrast phenotypic correlations across relatives. This “kinship” (or “pedigree”) based approach looks at how siblings are more phenotypically correlated than cousins, cousins are more correlated than second cousins, and so on. Siblings also share more environments than cousins, so non-genetic influence on the phenotype that tracks with genealogy will also look like “heritability”. Kinship-based models therefore provide us with a mushy estimate of narrow-sense heritability plus an unbounded amount of environmental confounding.
Long story short, several recent papers have now applied each of these methods to a representative set of complex traits. Young et al. (2018) ran RDR on 14 traits in Iceland, Yengo et al. (2025) ran SR on 14 traits in collaboration with 23andme, and Wainschtein et al. (2025) ran GREML-WGS on 34 traits in the UK Biobank. For any one trait, the estimates are often still much too uncertain and population specific (see below). But across traits, one can get a good sense of where the missing heritability is and where it isn’t:

Amazingly, with three different method and datasets, the estimated “narrow-sense” heritability comes in at ~30% (shown above in blue). What about rare variants? In the two studies where common variant / GWAS heritability was estimated (Young et al. and Wainschtein et al.), it accounted for ~85% of the total narrow-sense heritability estimate, indicating that rare variants make a relatively small contribution on average and GWAS provides an only slightly lower bound on total narrow-sense heritability. As tentative support, Wainschtein et al. conducted an exploratory analysis of the heritability from 760 million “ultra-rare” variants and found that they added essentially nothing (~0.12%) to the total heritability estimate on average.
The fact that both RDR and SR obtained nearly identical estimates of 30% is also striking. The primary difference between the methods is in how they treat gene-gene/gene-environment interactions, so lack of any difference in the estimates suggests that interactions have a limited role on average (with the caveat that the traits analyzed in these two studies were fairly arbitrary and mostly not overlapping). As tentative support, Wainschtein et al. used kinship data to estimate the contribution of non-additive genetic effects and found that an additive model generally fit very well across the traits.
Speaking of which, the corresponding kinship based heritability estimates of 41-42% were also strikingly similar. The gap between 30% (RDR/SR) and ~41% (kinship) is indicative of a sizable amount of environmental influence that correlates with genetic relatedness. As I’ve noted before, the mere fact that relatives exhibit similar traits is not sufficient to conclude that genes are strongly involved — relatives share environments too.
So where does that leave us in terms of “missing heritability”? Ideally we could compare the RDR and SR estimate to estimates from twins in the same exact cohorts and phenotypes. As a rough approximation, I dug up the corresponding Classic Twin Design estimates from the literature for the traits in each study and averaged them: unsurprisingly, they are in the 50-60% range, as is very typically observed in twin studies. So twin studies produce a ~2x inflated estimate of narrow-sense heritability when compared to molecular estimates that are free of environmental confounding. The mystery of twin heritability comes to an ignoble end: no massive tranche of rare variants, no phantom interactions, just inflation.
Is that really it?
Scientists don’t like to declare that a hypothesis has been falsified. It goes against the cautious academic “house style” and it puts some real skin in the game with regards to future results. I’m sure some of my colleagues will argue that we should continue to remain agnostic on which quantitative method gives us the most accurate estimate of causal population parameters. In my opinion, these arguments are no longer tenable.
Could there be something wrong with these fancy new methods or these datasets? The three different methods have very different assumptions and the three independent datasets were large, recruited in different ways, and used in many prior publications. Yet all three converged on very similar estimates.
Perhaps a massive tranche of ultra rare variants with massive effects will still explain the difference? There is a tendency to keep looking over the next hilltop for the answer, but this hypothesis has been thoroughly mined. The SR estimates already include the contribution of ultra-rare variants and the RDR estimates include the contribution of many (though not all) ultra-rare variants. Wainschtein et al. also found no contribution from ultra rare variants on average and the previous work of Weiner et al. (2023) estimated the gene burden heritability including ultra-rare variants at just 1.3% on average. Lastly, highly penetrant ultra-rare variants would have been discovered with pre-GWAS linkage analyses powered precisely for this scenario.
Perhaps genetic interactions explain the difference? SR already includes the contribution of genetic interactions in its estimate. Interactions are expected to be largely captured by additive heritability, and a substantial contribution of non-additive interactions also inflates classic ACE/twin study estimates and deflates the influence of the shared environment.
Is it wrong meta-analyze different traits together? This is exactly what was done in prior meta-analysis, finding that “across all traits the reported heritability is 49%” for twin studies. And what about assortative mating? Assortative mating impacts the within-family methods (RDR, SR, and twin studies) to the same extent, so does not change the molecular/twin gap. In any case, removing the few traits under high assortment from the Wainschtein et al. analysis has a negligible impact on the results.
It may be uncomfortable to conclude that a widely used study design has been producing spurious results. But the evidence is in, and telling uncomfortable truths is a part of doing science.
What comes next?
When a big question is answered, many new questions bloom.
Understanding individual traits
So far I’ve discussed all of the results in terms of cross-trait averages, which is admittedly a bit clunky. We can use the “average” trait to understand broad patterns about these study designs, but every trait will be slightly different. To get a sense of what trait-specific information could tell us, we can focus on one trait that has been analyzed in many studies: BMI (h/t to Vinay Tummarakota for first using this trait as a convenient test case). BMI is a very interesting phenotype because it is measured ubiquitously and with little error; is a mix of the physiological (metabolism, fat storage, etc) and the behavioral (satiety, self-control, etc); fundamentally interacts with the environment (your weight depends on your lifestyle and food access); but does not exhibit substantial between versus within family differences. Here are the relevant heritability estimates for BMI across a variety of methods:

The broad patterns we saw previously remain: rare variants (blue) contribute little3. Total narrow sense estimates (green) from RDR and GREML-WGS are very similar at 29-34%. Kinship estimates (orange) are substantially higher at 47-55%. But — interestingly! — sibling regression estimates (yellow) are also substantially higher at 39-55%, indicative of interactions (as previously hypothesized). Finally, twin estimates are again in the stratosphere at 65-75%, with estimates as high as 96% when analyzed in twins “reared apart”. So, in contrast to what we saw on average across traits, BMI seems like an example where investigating interactions could prove quite fruitful. Surely similar patterns could be found for other traits if the estimates were sufficiently accurate. Ideally, future studies will be large enough to apply all quantitative genetic methods within a unified cohort and contrast their results.
IQ is still not like height
BMI is interesting, but the traits that tend to produce the largest discordance between GWAS and twin estimates are those related to cognition and education. Perhaps not coincidentally, these traits also tend to exhibit the most population stratification, environmental variability, and (at least in the case of educational attainment) familial effects — all of which can induce bias if not properly modeled.
Wainschtein et al. provide an apt demonstration of these challenges in their GREML-WGS analysis of educational attainment and a fluid IQ score. Prior to adjusting for any environmental covariates, both traits exhibited very high heritability estimates (48-61%). After adjusting for genetic ancestry components, the heritability estimates dropped substantially (40-43%). After adjusting for geographic clustering the heritability estimates dropped further (38-40%). After adjusting for more geographic clustering, the heritability estimates dropped even further (32-34%). As a comparison, we can look at the same covariate adjustment for standing height and see that there is essentially no impact whatsoever. In fact, education and IQ exhibited by far the strongest evidence of stratification compared to the other traits analyzed (which were largely anthropometric or blood/lipid-oriented).

The authors stopped at 100 geographic clusters, but does this adjustment fully correct for environmental biases (or maybe over-correct?) — no one really knows! We do know that a substantial fraction of the apparent educational attainment “heritability” is actually indirect associations from stratification or parental effects, and that disentangling these associations requires genotyped family data. So in a fundamental sense, GREML-WGS is the wrong tool for this question (direct narrow-sense heritability) for these traits (stratified behavior)4.
Ultimately, large-scale RDR and SR analyses will be needed to resolve the question of why behavioral traits differ so substantially between GWAS and twins. While Yengo et al. did not investigate cognitive-behavioral phenotypes (apparently due to some arbitrary data restrictions) a few recent studies might gives us a preview: Markel et al. conducted the largest SR meta-analysis of educational attainment to date, across ~80,000 sibling pairs, and estimated a heritability of just 7.6% (s.e. 9.5%); Wang et al. applied SR to educational attainment in the Mexico City Prospective Study and estimated the heritability to be -10% (with a standard error of 11%) — yes, negative.
Twin research adapting in real time
When I started this blog I outlined several potential explanations for the missing heritability problem, including this one:
“The twins are wrong because the equal environment assumption is routinely violated and MZ twins are fundamentally different from DZs. This is perhaps the least interesting outcome from the perspective of science, since it is simply a methodological flaw. But it it is fascinating from the perspective of the history of science in that it would undermine a swathe of major findings in twin-based behavioral genetics for over a century, a reality the field will need to adapt to in real time.”
And now here we are. The big conceptual questions going forward are: Can twin studies recover the true un-inflated estimates through more careful control of environments? How has twin study inflation influenced other parameters estimated from such studies (e.g. when twins are used as genetic controls or to correct estimates of intergenerational transmission for genetic confounding)? And, finally, will twin researchers care? Perhaps we should call it the “missing environment” problem.
Update: Eric Turkheimer has some more useful discussion on what heritability does and does not mean, and the “three legs of the missing heritability problem”.
See Veller, Przeworski, Coop (2024) for more discussion of direct effects and Barry et al. (2022) for more discussion of heritability estimation methods.
“G” is for genetic, “REML” is the algorithm used to do the model fitting (a simple likelihood maximization with some tricks to deal with covariates), and “WGS” is for the sequencing data that goes into it.
Curiously, Wainschtein et al. estimated significantly higher rare variant heritability for BMI than and a recent analysis of the same exact data by Hawkes et al. using a slightly different method. Most likely this is a function of the way the data was QC’ed and processed and we would do well not to over-interpret any individual trait estimate until several groups have independently conducted these studies.
In putting together the figures on cross-trait average estimates, I investigated dropping any traits that exhibited large differences before/after correction for ancestry/geography. This did not substantially change the results (slightly decreasing the rare variant heritability from 5% to 4% and slightly increasing the common variant heritability from 22% to 23%).





To me this is like dark matter or placebo effects. We are being asked to believe in a great many hypothetical things that we cannot observe and to disbelieve things that we can observe as easily and as frequently as we wish. We are asked to believe in vague "unknown confounders" in twin studies. Can anyone propose a single sensible confounder that could explain this discrepancy? And what even counts as an "environment"? Do we actually observe siblings having the same environmental experiences? The events of my life have only a fractional overlap with those of my siblings. I was not even present for most of the things that would have befallen them. For some reason we all go around talking about "shared environment" like we know what it means. Unless you are a conjoined twin, I cannot imagine how you could have the same "environment" in any scientific sense.
And this is the heart of it to me. If siblings in the same household frequently not share meaningful experiences, then whatever tiny sliver of "shared environment" does exist must be even smaller when we compare MZ and DZ twins. The difference between being dressed the same and being referred to as "the twins" is trivial in scale compared with the sheer magnitude of the MZ–DZ correlation gap. These minor differences cannot possibly account for systematic and reproducible heritability estimates. The non-shared experiences overwhelms everything, and what environmental similarity remains is negligible. If anything, the divergence of lived experience strengthens the logic of twin studies, because it means the confounding potential of environment becomes vanishingly small.
To my mind, twin studies are therefore definitive, akin to testing a gun by shooting yourself. They control for any confounder that matters, and what we are more likely witnessing is some peculiar problem in how modern statistical methods carve up genetic variance. When one method captures the whole landscape and another measures only a thin slice of it, the discrepancy should not surprise us.
It is a case of the math telling us the moon is made of cheese and that we should jump off that cliff. My suggestion is to keep a firm grip on reality. Yes, it is hard to explain how three methods could converge, but it is like asking "can people levitate?". It is a question of what is clearly visible to the naked eye, and not in the sense of a streak in the dark but in the sense of something you can observe repeatedly, like the sun.
I'm not qualified to have an opion or ask a meaningful question here. Just wanted to say thank you.