82 Comments
User's avatar
Peter's avatar

To me this is like dark matter or placebo effects. We are being asked to believe in a great many hypothetical things that we cannot observe and to disbelieve things that we can observe as easily and as frequently as we wish. We are asked to believe in vague "unknown confounders" in twin studies. Can anyone propose a single sensible confounder that could explain this discrepancy? And what even counts as an "environment"? Do we actually observe siblings having the same environmental experiences? The events of my life have only a fractional overlap with those of my siblings. I was not even present for most of the things that would have befallen them. For some reason we all go around talking about "shared environment" like we know what it means. Unless you are a conjoined twin, I cannot imagine how you could have the same "environment" in any scientific sense.

And this is the heart of it to me. If siblings in the same household frequently not share meaningful experiences, then whatever tiny sliver of "shared environment" does exist must be even smaller when we compare MZ and DZ twins. The difference between being dressed the same and being referred to as "the twins" is trivial in scale compared with the sheer magnitude of the MZ–DZ correlation gap. These minor differences cannot possibly account for systematic and reproducible heritability estimates. The non-shared experiences overwhelms everything, and what environmental similarity remains is negligible. If anything, the divergence of lived experience strengthens the logic of twin studies, because it means the confounding potential of environment becomes vanishingly small.

To my mind, twin studies are therefore definitive, akin to testing a gun by shooting yourself. They control for any confounder that matters, and what we are more likely witnessing is some peculiar problem in how modern statistical methods carve up genetic variance. When one method captures the whole landscape and another measures only a thin slice of it, the discrepancy should not surprise us.

It is a case of the math telling us the moon is made of cheese and that we should jump off that cliff. My suggestion is to keep a firm grip on reality. Yes, it is hard to explain how three methods could converge, but it is like asking "can people levitate?". It is a question of what is clearly visible to the naked eye, and not in the sense of a streak in the dark but in the sense of something you can observe repeatedly, like the sun.

Sasha Gusev's avatar

You don't need large-scale molecular data or fancy math to see that twin estimates are all over the place. Just see the twin study of IQ by eminent behavior geneticists that estimated a heritability of 20% in Danish twins and 98% in Swedish twins (https://pmc.ncbi.nlm.nih.gov/articles/PMC4795559/). Or one of the most prestigious twin studies of BMI that estimated a heritability of 30% using one set of twins and 110% using another set *in the same cohort* (https://www.nejm.org/doi/full/10.1056/NEJM199005243222102). Or one of the most prestigious twin studies of cancer that found melanoma is 2x more heritable than breast cancer (https://pmc.ncbi.nlm.nih.gov/articles/PMC5498110/) -- a finding completely at odds with everything we know about these malignancies. Or the fact that they find absurdly high heritability estimates for traits we *know* to be environmentally mediated, like peanut allergy, or stochastic, like the genre of literature you check out from the library. If you actually read these papers, like the SATSA twins-raised apart study above, the authors are pretty open about the fact that these models are often wrong and produce very crude results. It's only when heritability became politically useful that these issues got swept under the rug.

Peter's avatar

Thanks. But doesn't garbage research like this raise more fundamental questions about methods and measurement?

Also, I can't help but think measuring IQ or BMI is only marginally less problematic then measuring coastlines.

It seems like very old twin studies that involved the interviewing of cases by the investigators had a Bayesian advantage. It is interesting that the troubles seem to begin when samples are drawn from existing databases.

In real life, these findings are absurd. Obviously there isn't some secret difference between Danes and Swedes. So if the absurdity of these findings is visible to the naked eye, then isn't it possible that we're seeing something similar with GWAS?

Patrick Rich's avatar

It seems to me like holding onto your priors in the face of potential contradictory evidence is the most decidedly non-Bayesian approach I could imagine. If you establish a prior that only accepts evidence that is consistent with your prior, and only critiques evidence that is inconsistent with your prior, you are not taking a Bayesian approach. Or at least, you have established a prior that is so heavily weighted that whatever comes after is pointless.

"It is a question of what is clearly visible to the naked eye." There are many things that are clearly visible to the naked eye which provide only descriptive accounts of phenomena and reflect no true understanding of what we know. More sophisticated observation reveals the faults. Sugar, for example, is no more likely to cause hyperactivity than a non-sugar sweet placebo, but knowing a child has consumed sugar biases how parents perceive their child's behavior. Yet, to the parent, it is clearly visible that the sugar causes hyperactivity.

Peter's avatar

Good point, I'm going at a bit of a gish gallop here but I do understand those things. You pass over the sugar point as if it's some great example but I think it misses something deeper. The idea that it's parental bias in a pure sense is kind of shallow. It doesn't really account for the phenomenon. A better explanation might be that neither food causes true hyperactivity in the clinical sense, instead the sugar simply causes energy levels to return to a higher baseline causing the apparent boost in activity. There is bias at play, and environment, being handed a sugary snack is cause for jubilation. It is visible to the naked eye, that when handed a treat children will, on average, become more active. You're right to point out that this doesn't mean sugar causes hyperactivity.

A good explanation must provide a satisfactory account of all the evidence. When different lines of evidence converge, a good degree of confidence can be had. If the evidence is at odds we should be very careful about which evidence we blame for the disharmony.

My point about Bayesian reasoning was poorly explained. I'm saying that in much older studies when twins were examined one by one we got very high rates of concordance. This method is Bayesian as the interviewer had the advantage of being able to compare both twins directly, thus greatly reducing the chances of misdiagnosis. I'm arguing that this is a strength and not a weakness.

Patrick Rich's avatar

I completely agree that the sugar example is imperfect. It was a quick example just to reflect the importance of caution regarding what types of "obvious" data we trust. As you noted, the more likely explanation is that sugar is a reinforcer/reward, and those tend to increase positive affect, which can be interpreted as hyperactivity. But there are other examples in other areas of science where things that seem obvious even through well-conducted research end up being misunderstandings of the complexity of biological systems. My point was that when we look to attribute the *cause* of a particular trait/condition/behavior we have to accept that the more we explore and understand the complexity of human systems, the more nuanced "cause" becomes.

As you said, a good explanation must provide a satisfactory account of all the evidence and we have to be careful about which we dismiss. As always that's a two-way street. I don't even know that critics of twin studies would disagree that we have to be cautious about which we dismiss. In my readings of the debates, I find the critics of twin studies are also often the ones *most likely to critique both types of research*.

Ken Kovar's avatar

That’s what I hate about a lot of people who cite this kind of research. The quality is statistically bad and even lay people can get that but when someone is making a political point, suddenly the studies are solid science on a par with physics 😎

User's avatar
Comment deleted
Nov 22
Comment deleted
Charlatan's avatar

What about the high concordance rate among twins reared apart?

Peter's avatar

Look, I'm galloping a bit. Those are all good points. I guess I'm just not sure equal environment mattered much in the first place. I suppose you could call that a hunch or a belief and I know it's not popular.

Gerry P's avatar

I'm not qualified to have an opion or ask a meaningful question here. Just wanted to say thank you.

Joe Ross's avatar

Given the history of the subject, I understand why it was originally the "missing heritability" problem. Those of us who work in clonal multicellular organisms (e.g. selfing plants, C. elegans and the like) were already pretty sure it was a "missing environment" problem, because there's plenty of evidence that genetically identical individuals (aside from de novo mutations) raised in the same controlled environment have strikingly diverse traits (e.g. lifespan, fecundity). I concur that MZ twin studies assume shared environment when there are certainly differences, and we may have reached the limit to which we can control environments (e.g. two genetically identical C. elegans living on the same Petri dish with access to the same amount of the same food source in a climate-controlled incubator). Is it appropriate to extrapolate these findings to the case of humans and very clearly realize that of course there are going to be some missing environment estimates that we can't observe (or potentially even imagine what they might be?) In research, I've recently been forced to wonder more about these questions - so thanks for the very timely post! - and since I'm not a quantitative geneticist, I wonder: I assume that stochastic events are already incorporated as "environmental" in heritability studies because they're not genetic? For example, even MZ twins presumably have minor but potentially important random events that happen during early development: like how cytoplasmic resources are randomly partitioned at the first embryonic cell division. One twin gets slightly more mitochondria than the other, for example. The twin embryos are still genetically identical, but started life with a potentially major difference that can't readily be observed and can only be controlled for with great difficulty (in model organisms). Are such effects categorized as environmental in the literature, or could random events be part of the remaining missing environment?

Sasha Gusev's avatar

Thank you for the comment. Most of these methods treat "the environment" as everything else that is not genetic, so the environment will include completely stochastic behavior as well. I think a big open question is to what extent the "missing environments" I alluded to are structured/shared versus stochastic/random.

JP's avatar

Would you hazard an educated guess?

F Gregory Wulczyn's avatar

"there's plenty of evidence that genetically identical individuals (aside from de novo mutations) raised in the same controlled environment have strikingly diverse traits (e.g. lifespan, fecundity)."

Spontaneously, that seems to suggest that for those traits the Gene-Phenotype relationship is not a monorail, ie. a given genotype has a range of phenotypic variability baked in (with the outcome subject to environmental influence). Are these observations true for all traits or only some? Are these individual differences heritable?

beowulf888's avatar

> ...even MZ twins presumably have minor but potentially important random events that happen during early development: like how cytoplasmic resources are randomly partitioned at the first embryonic cell division.

Is that a hypothetical, or have unequal division of resources actually been observed during cell division in eukaryotes? If so, can you provide a link to a good study? Thanks!

Joe Ross's avatar

Asymmetric partitioning of mitochondria (as an example) has been observed in various organisms for quite some time. The example I used extrapolates from studies like this one https://www.nature.com/articles/s41467-025-62484-5 (thanks for asking, because I hadn't seen this paper yet, and I might not have for a while more without your prompt!) This particular study is about asymmetric cell divisions, though, which one could argue might not apply to some of the initial divisions of an embryo. The prior literature on mitochondrial segregation during cell division does routinely find that the number of mitochondrial that are partitioned into daughter cells is not equal.

john's avatar

As somebody who has spent their career estimating heritability and genetic correlations and then applying the results to breed more productive animals. This includes using recorded or estimated parentage or genomics. Several comments come to mind: the first is the passion of the human debate from those with opposed views, cherry picking the result that best suits them. The second is contrasting the methods used in humans versus those used in animals (yes, with different Ne and environmental conditions). There is no mention that maternal effects may also be heritable, and that the genetic correlation between direct and maternal effects is estimated with uncertainty. The second is the approach/consequences to estimating the effect of GxE on the results. In some cases, GxE will inflate the heritability estimate if not accounted for correctly, and in other cases, it will reduce the estimate. That said, there is a considerable group of people who find any estimate of heritability above zero to be incompatible with their worldview. To them 0.3 is close enough to zero. By the way the joke amongst animal geneticists is that heritability of almost any easily recorded trait will be between 0.2 and 0.4.

SM's avatar

I’m not smart enough to understand this all on first pass, but I’ll get there eventually. I’m quite curious where we will land for major mental illness (bipolar, psychosis) as clinically it is quite evident there is a size-able risk in pro bands of affected family members. But perhaps there is more confounding than the twin studies estimates.

the kid's avatar

Hey, good post - thanks for sharing. The analysis and discussion of BMI was particularly really nice.

Did have a point of disagreement though: although inflation of twin study estimates has been attributed by some to equal environment assumption (EEA) violation, such a view is in fact not supported by empirical evidence (for example, most notably, by comparing estimates obtained form classic twin vs twins reared apart studies) - I'd say that EEA violation is likely to only play a weak/minor role for the vast majority of, if not all, traits. Would instead argue that the inflated twin study estimates are more likely to be explained by unmodeled gene-environment correlations and interactions (as described in Falconer's classic work for example).

Sasha Gusev's avatar

But recall that gene-gene and gene-shared-environment interactions are also included in the sib-reg estimate.

Simon Kinahan's avatar

Isn't it still possible that there are non-linear interactions between SNPs that SR will not capture because they don't have a linear-looking component?

Sasha Gusev's avatar

SR will capture pairwise interactions entirely and a large chunk of higher order interactions (see here: https://unboxingpolitics.substack.com/p/contra-scott-alexander-on-missing#footnote-anchor-19-167620808 or the Young et al. RDR paper). The interactions need to be explicitly higher-order to be missed by SR, and in that case they will actually inflate the ACE/twin estimate.

Simon Kinahan's avatar

Thank you! That makes sense.

the kid's avatar

Agreed Sasha, (in fact my own preference is to refer to such sib-reg estimates as approximating broad sense heritability), but what about GxExG interactions such as gene x unique environment x gene interactions? I think these might be genuine possibilities worth considering.

Alexander MacInnis's avatar

Thank you, Sasha. This is very interesting and points me towards some things I would like to study further. I'm an epidemiologist. My training did not include the details you discuss here.

I would to ask you a favor. It might benefit many others, too. Please explain clearly and concisely whether and how RDR and SR avoid interaction between genetic susceptibility and environment. I am referring to genetic profiles that putatively cause a phenotype that may actually require some environmental factor in order to produce the phenotype. That is, each is a component cause and a full set of component causes produces a complete cause, which produces the outcome. The low penetrance of common variants implies that such GxE interactions might have significant explanatory power.

Thank you in advance.

Sasha Gusev's avatar

Thanks. Just to be clear, every method here will include components of the environment that are directed by the genetics. In the classic example where kids with red hair are discriminated against and kept out of school, all of these methods will estimate that schooling is highly heritable. With respect to environmental interactions, RDR avoids including them in its variance estimate because it primarily compares completely unrelated individuals that are unlikely to share close environments. Whereas SR primarily compares siblings and thus *will* include the influence of interactions with the environment that is shared between siblings. All methods will also include interactions with the "national" environment since most of these estimates are meta-analyses across national cohorts.

Alexander MacInnis's avatar

So, clearly heritability estimates from studies comparing MZ to DZ twins do not show limits on the effects of environment. Such studies implicitly assume that there is no gene-environment interaction. Not only is that assumption almost never stated and never verified, the low penetrance inherent in common variants shows that the assumption of no gene-environment interaction is false.

RDR and SR can help in some cases. But environmental exposures that change over time and that are fairly consistent with the population studied can easily cause GxE interaction that is not detected.

the kid's avatar

Thank you Sasha (and Alexander for initiating the very useful/interesting query).

Sasha, will SR estimates (even those from full sib studies) include all putative gene-environment interaction effects or is it possible that some potentially significant proportion will go uncaptured?

VerumSerum's avatar

The decode group also interestingly showed with deep sequencing on MZ that there ARE a decent amount of sequence differences. But as mentioned these may contribute very little. One part of this that always bothers me goes back to Fraga 2005 and the epigenetic component. Some of our family based epigenetic work clearly showed that an enormous fraction of “heritability” of methylation is really under genetic control. BMI seems tricky but it’s always used. I wish we focused on more tangible biological functions like T cell activation with more focus on MHC. Ie GxE on the immune system. Anyway great article! Will read it again GREAT reference on this topic!

Sasha Gusev's avatar

Thanks! You might have seen it but the recent decode study of methylation/expression has led me to believe that almost all of the gene relevant epigenetic activity is actually under genetic control (https://www.nature.com/articles/s41588-024-01851-2).

VerumSerum's avatar

Yep reminds me of some work by Andy Feinberg a few years back looking at “GeMes”

JP's avatar

Thank you for the great, and very readable, post. My only real comment is that I especially appreciate your interaction with so many of the commenters, even those who might come off as, let’s say, “unconvinced,” or who clearly didn’t read carefully. That you continue the interaction through multiple reply “rounds” is even more impressive.

Your patient and clear responses absolutely help those of us who arrive later & who may have similar questions. It’s like having a Q & A period after a talk that helps clarify for many people, and to then have that session immortalized for those unable to attend. Cheers!

Sasha Gusev's avatar

Thanks, much appreciated. I'm swear I'm working on being less grumpy.

JP's avatar

Given the context here… good luck Sisyphus ;) I think we would all understand any lapses

Ben's avatar

Really well-explained article, Sasha! My eyes sometimes glaze over when I’m obligated to read biology but not so in this case 😆

I feel like I’ve tried to avoid reading too much from the ppl that might be most inclined to offer rebuttals to this, but have most of them essentially come round to this viewpoint or are they holding out?

Sasha Gusev's avatar

Thanks! Surprisingly I haven't seen much of a response. I think people generally agree that twin estimates are inflated, though there is still much to be learned about individual traits. On the more unseemly parts of the internet (race twitter, etc) I've mostly just seen the argument that these results are unintuitive and inconvenient, therefore we should not speak about them.

JP's avatar
Dec 4Edited

Interestingly, Pinker responded but specifically referred to GWAS deflation rather than twin-study inflation 🤔

https://x.com/sapinker/status/1996282777650839924

Sasha Gusev's avatar

I thought this was pretty funny since none of the studies discussed use GWAS data :) I guess people see what they want to see.

JP's avatar

Minor detail lol

fox's avatar

Please let me know if I’m making a mistake here, but the reliability of the UK Biobank Fluid Intelligence test is only ~0.65. Once you account for this using a Spearman correction for attenuation, the estimates become h²_true ≈ 0.33/0.65 = 0.51 and h²_true ≈ 0.40/0.65 = 0.62. This adjustment is the standard way to account for measurement noise in the test, and it brings the estimates much more closely in line with the original twin studies.

Sasha Gusev's avatar

As noted in the post, I don't think the estimates for IQ/EA from Wainschtein et al. are particularly useful because of the unknown levels of over-/under- correction for stratification. I would be very surprised if IQ/EA are the two traits that happen to show no drop in heritability with proper within-family molecular methods, but this remains to be seen. I'll note that Williams et al. (https://pubmed.ncbi.nlm.nih.gov/36378351/) worked very hard to estimate a high quality general factor of intelligence in the UKB and the heritability did not budget at all, so I'm skeptical of using these simplistic corrections.

fox's avatar
Nov 24Edited

All of what you’re saying may be true. I don’t have a strong opinion about their estimates in and of themselves nor is this about g. My point is about how these particular results should be interpreted. That correction isn’t “simplistic” so much as mathematically necessary to account for measurement-error variance in the FI test. For example, even if FI had a true heritability of 1, the observed estimate would be equal to the reliability of the test.

Sasha Gusev's avatar

I don't think we should be interpreting these particular results at all, for reasons I explained in the post. The correction is simplistic in its assumptions: that the latent variable being measured is unidimensional and that the error on the latent variable is uncorrelated with genetics. Neither assumption has been evaluated and the fact that the latent "g" inferred by Williams et al had the same heritability as the observed IQ suggests the assumptions do not hold.

fox's avatar
Nov 24Edited

It’s fair to say we shouldn’t interpret the results at all, but this is just a basic classical test theory identity. While it doesn’t depend on unidimensionality, it could in principle be affected by genetically correlated error, but that would be pretty strange, and as far as I can tell the psychometric consensus is that UKB FI’s low reliability is mostly item-level/short test noise. Multidimensionality plays a role in construct validity, which might affect what you’re saying about g.

Sasha Gusev's avatar

I'm not really sure what the dispute is. This basic classical test theory identity requires satisfying a specific set of basic classical assumptions, the assumptions were not satisfied (in fact they were not tested) so applying the correction is not appropriate.

Separately, while the FI test has a reliability of ~0.6, the general factor has a reliability of ~0.8. Yet the heritability of the two was almost exactly the same in Williams et al. (in fact, the FI is a tiny bit *more* heritable). This is good indication that the "pretty strange" genetically correlated error is in fact at play and the correction just induces bias. That's why it is important to evaluate assumptions!

Ebenezer's avatar

Emil Kirkegaard seems to have made a similar point: https://www.emilkirkegaard.com/p/what-did-the-new-wgs-ukbb-study-show

Funny how two people who disagree can each read the exact same paper, and each conclude that it confirms their worldview...

Jack's avatar

In an earlier comment, you mention that heritability estimates are deflated for in-family methods like RDR when there is assortative mating? If so, wouldn't the heritability estimates produced by these methods also be deflated for traits like educational attainment, IQ, and to a lesser extent, BMI? Similarly, wouldn't we expect twin study estimates to be inflated due to assortative mating? Are you able to quantify how much each of these traits are being inflated/deflated?

Michael A Alexander's avatar

So, can I conclude that for a trait like IQ for which twin studies showed ca 60% heritable factors and genetic studies that show a ca. 30% genetic factors that there is also a 30% heritable, but not genetic, factor. As an adherent of dual inheritance theory, I would put this heritable, but not genetic, group into a cultural category. That is the environment that people experience in our modern culturally constructed world is mediated through the heritable cultural identity/nature they carry with them.

Michael Coleman, Ph.D.'s avatar

According to the RDR paper cited herein (Young et al), the estimated heredity for the height phonotype in the Iceland population sampled is approximately 0.8 from Twin studies and 0.55 from RDR. Intuitively, one would expect a well fed population to have height heredity approaching 100%. The twin study value is very likely closer to reality and thus brings into question many of the remaining claims in the piece.

Sasha Gusev's avatar

The height result is an interesting example where intuitions can be misleading. Height is well known to be under assortative mating, which induces a downward bias in all within-family estimators (twins, sibling-regression, and RDR). If you take the typical twin estimate of 0.8 and correct it for moderate amounts of assortative mating seen in biobanks (mate correlation of 0.27 - https://pmc.ncbi.nlm.nih.gov/articles/PMC10967253/) you get an estimated equilibrium heritability = 1.17 which is obviously implausible. If you do the same to the RDR estimate of 0.55, you get an estimated equilibrium heritability = 0.67 which is nearly spot on the estimate of heritability = 0.69 from Icelandic pedigree data (https://pmc.ncbi.nlm.nih.gov/articles/PMC3667752/) [which, in contrast to within-family methods, is slightly inflated by assortative mating]. In short, after accounting for assortative mating, the RDR and pedigree estimates line up while the twin estimates are completely implausible. Most traits are not under strong assortment so this will not be a wide-spread issue, but it highlights why actually doing the math is important over trusting our gut.

Michael Coleman, Ph.D.'s avatar

RDR is supposed to be free of the need for such a correction. Twin studies comparing monozygotic and dizygotic twins have the parents as a controlled (constant) variable - thus your correction is not appropriate in that case.

Interestingly, Wainschtein et al, which you reference and which I just started reading, claims that WGS with a much wider search, comes much closer to the pedigree based heredity values on a range of phonotypes. They report height heredity of 0.71 based on WGS, supporting my intuition and my assertion that the Twin estimate is likely more accurate than the RDR value.

Sasha Gusev's avatar

No, this is all incorrect. RDR, sibling regression, and twin studies are all deflated by assortative mating because assortative mating increases the population level genetic variance but *not* the within-family genetic variance. See Young (https://www.biorxiv.org/content/10.1101/2023.07.10.548458v1), Kemper (https://www.nature.com/articles/s41467-021-21283-4), or my own previous derivations (https://theinfinitesimal.substack.com/p/some-notes-on-assortative-mating).

To your second point. Wainschtein et al looked at data from the UK, while RDR (and the pedigree data I cited) looked at data from Iceland. The UK and Iceland are different populations and since heritability is a population-level parameter we have no reason to think these two populations should produce the same heritability for any given trait.