Book Review: Eric Turkheimer's "Understanding the Nature-Nurture Debate"
Or, some thoughts on The Gloomy Prospect and the Gloomy Present
You have probably heard of the three laws of behavioral genetics:
All behavioral traits are heritable.
The effect of being raised in the same family is smaller than the effect of genes.
A substantial portion of the variation in complex human behavioral traits is not accounted for by the effects of genes or families.
Proposed by Eric Turkheimer in 2000 after decades of research, they are at once profound and largely misunderstood insights into human behavior. The laws map to the twin “ACE” model parameters from which they are largely derived: the first law is about “A”, the additive genetic component; the second law is about “C”, the shared environment; the third law is about “E”, the non-shared environment.
Now Turkheimer has written a book. Ostensibly a primer on “The Nature-Nurture Debate”, it is more a treatise on the century-long effort of behavioral geneticists to explain why humans act the way we do and the fundamental inability of observational data to provide easy answers.
Counting telephone poles and collecting stamps
The book is anchored by two main themes. The first theme focuses on research that primarily aims to measure and catalog various correlations. Turkheimer alternatively describes these studies as “stamp collecting”, “counting telephone poles”, etc. Collect a large number of measurements from some people, estimate the correlations, report the significant ones (if you didn’t find any significant ones, collect more data until you have), and publish. We know the kind of studies he is talking about. Turkheimer then charts a path starting from Galton’s early work measuring, quantifying, and ranking various behaviors, people, professions, and ultimately “races”; through modern quantitative genetics cataloging twin correlations and variance components; to contemporary molecular genetics and GWAS cataloging thousands of individually associated variants. This path is not an equivalence, Turkheimer distinguishes Galton’s impressive “stamp collecting” from the repulsive racism which it ultimately served. But there is a shared method in the exhaustive quantification of correlations: they generate all sorts of tantalizing stories, but the actual inference of causes is typically left for someone else to fill in.
The second theme is the view that humans are both like and unlike animals in fundamental ways: “The genetic (as in Genesis) paradox at the heart of the human condition is now coming full circle. We are animals, full stop. Yet we are not.” (Turkheimer). Like, in that all living things are driven by fundamental biological processes, are born, mate, reproduce, and die. Unlike, in that humans cannot be studied through manipulation: fixing or modulating environments, controlled and cross- breeding, these are all untenable and morally abhorrent. The fields of behavioral genetics, and the social sciences more broadly, are thus forced to devise a series of “workarounds” draw causal conclusions in humans without the ability to manipulate them:
One’s opinions about social science depend fundamentally on attitudes about these workarounds. Are they, in the Galtonian spirit, clever improvisations that allow human traits to be brought to the edge of the bright light of natural science? Or are they (in the dark side of the Galtonian spirit) hapless kludges that can then be used to justify thinking about humans as though they were rats or cattle?
In my view, what connects these two themes — stamp collecting and the inability to manipulate — is the commonly held premise that by collecting enough stamps one can understand the causal processes driving complex outcomes without the need for manipulation. That if we simply do a large amount of the former, we can answer the unanswerable questions presented by the latter. This view is not isolated to the social sciences. You see it everywhere, with each new technology — “causal inference”, “Artificial Intelligence”, “Big Data” — promising to obviate the need for manipulation. Recently, this view is perhaps best personified not by the social scientist (social science has become much more aware of the importance and limitations of causal inference) but by the Silicon Valley investor:
But mother nature has blessed us with something that feels very much like a manipulation. One weird trick. Or, more formally, a “natural experiment” in the (nearly) random occurrence of genetically identical and half-identical twins. We do not even need to measure the genetic similarity since we can see it with our own eyes.
The one weird trick: twins
Turkheimer, a clinical psychologist and also a twin researcher, reviews the history of twin studies of human behavior. Deriving the Falconer equations that allow one to estimate a “heritability” parameter from just two correlations and explaining how they work. Providing pithy biographical sketches of the luminaries of the field: Cyril Burt (disgraced), Hans Eysenck (disgraced), and eventually Robert Plomin (prolific) and how they shaped it. The twin model was a way for behavioral genetics to shake it’s eugenic past and move towards a clean, analytical future. And this transition did lead to important findings. First, that two traits co-occurring in a family does not mean one causes the other: depressed children having depressed parents does not - in and of itself - mean parents cause depression; smart kids having more books in the home does not - in and of itself - mean having more books makes you smart, etc. Second, helping to dispel the myth that psychiatric conditions are solely the consequences of bad upbringing or bad behavior. One could imagine an alternative universe where the field takes twin studies "seriously but not literally” and follows a middle path: behavior is influenced by genetics to some extent, so observational studies need to incorporate genetically informed designs; but twin models are also influenced by environmental assumptions to some extent, so their estimates shouldn’t be taken at face value either. Instead, the idea of having a seemingly fool-proof natural experiment proved too attractive for caution. Twin heritability was enshrined as a fundamental biological parameter of deep value. A thousand twin studies bloomed, quantifying the heritability of every phenotype, behavior, or measurement one could think of. And at a certain point, the stamp collecting became and end unto itself:
If one can identify a research paradigm with a nearly guaranteed outcome that at the same time seems to confirm an important scientific theory and score points in an ancient philosophical debate, it’s a gravy train. That’s what twin studies had become. Everyone agreed that the question of “how genetic” behavioral differences are was an important scientific question, and that twin studies were a useful way to answer it. Best of all, twin studies always worked! There was no chance that after all that effort, identical twins would turn out to be uncorrelated on the questionnaire, or that fraternal twins would somehow be more similar than the identical ones. None of the promises about investigating the actual biological basis of behavioral differences were ever fulfilled, but we could worry about that later.
Twin studies are certainly not the only scientific gravy train, and Turkheimer has highlighted what may be a universal factor: a method that ostensibly answers an important scientific question but is, in fact, always guaranteed to come up in the investigator’s favor. Even better if running the method is resource constrained or otherwise expensive, so that the senior researchers who have “paid their dues” can be at the helm. And if you think you are hearing the echoes of some modern gravy trains — the countless single cell atlases, high-throughput screens, Mendelian Randomizations of dried fruit intake on time spent outdoors, etc. — well, yes, I hear those echoes too.
So the pile of ACE estimates grew, with A being immeasurable and C typically lower than E. Eventually, the field was forced to move beyond quantification and actually try to make sense of E, the “non-shared” environment:
Plomin and Daniels proposed that figuring out the specifics of differential environmental effects on siblings should define the social scientific agenda in the nineties, and they succeeded in creating a paradigm. Hundreds of studies were conducted, in which social scientists measured the differential experiences of siblings and used those differences to predict differences in behavioral outcomes. Then a funny thing happened: there was nothing there.
The relevant paper — Turkheimer and Waldron (2000), a prelude to The Three Laws published in the same year — is remarkable both in its breadth and force. It is a meta-analysis of 43 studies investigating the influence on sibling outcomes from a gamut of environmental differences: differential parenting, differential peer groups, differential sibling interactions, differential teacher relationships, within-family factors, and interactions of all of the above. The raw meta-analysis finds a very weak effect of these differential experiences on outcomes (explaining <5% of their variance). When further restricting to studies that employed genetic controls, that estimated effect is essentially reduced to nothing.
Importantly (though Turkheimer does not go into this) it is not the case that the environmental factors are simply stochastic. Studies that look at families where one sibling was adopted away and one remains, have found that IQ increased substantially in the adopted sibling (4.4 pts on average), with a greater increase in adoptive families with higher education (7.6 pts on average); with similar findings for other value-laden traits like criminal behavior. Adoption is, of course, a sudden manipulation of the entire home environment1, making it impossible to disentangle which specific component was the underlying cause. But it provides some quasi-experimental evidence that environmental changes can be a general influence on behavior, not just completely random fluctuations. So why can’t the specific influences identified?
Turkheimer’s answer is The Gloomy Prospect: a kaleidoscope of idiosyncratic gene-environment interactions and correlations that unfolds over the course of development as individuals encounter, select, match into, and reshape their environments and the environments of those around them. Why does the shared environment matter so little (The Second Law)? As Turkheimer and Waldron point out, in the twin/ACE model “shared environment” is merely anything that makes siblings more correlated. Factors like divorce, inherited wealth, common schooling, etc. that one would typically think of as “shared” can still be assigned by a twin model to E (the “non-shared” environment) if they do not make siblings more similar. If siblings going through a parental divorce compete for the favor of their parents, leading one to benefit and the other to suffer — that divorce gets quantified as a “non-shared” environment. By the same token, a shared experience will be assigned to “A” (genetics) if it interacts with the genetic variation: if genes behave differently in different family contexts, or match into special environments. The family environment that appears influential in the sibling-adopted-away studies can be shredded into E or A in twin studies, with any individual correlation or interaction too minuscule to identify. This is where the field of behavioral genetics had landed at the end of the 20th century: both genes and environment matter and we have no idea how.
But mother nature was about to provide the field with another “natural experiment”: the ability to measure individual alleles in the human genome and to correlate them with phenotypes. And to do so across many humans and at scale. The final section of The Three Laws is titled “Anticipating the Genome Project”, and it included some stark predictions about the coming molecular era:
If the underlying causal structure of human development is highly complex … the relatively simple statistical procedures employed by developmental psychologists, geneticists, and environmentalists alike are being badly misapplied. But misapplied statistical procedures still produce what appear to be results. Small relations would still be found between predictors and outcomes, but the underlying complex causal processes would cause the apparent results to be small, and to change unpredictably from one experiment to the next. So individual investigators would obtain “results,” which would then fail to replicate and accumulate into a coherent theory because the simple statistical model did not fit the complex developmental process to which it was being applied. Much social science conducted in the shadow of the gloomy prospect has exactly this flavor (e.g., Meehl,1978).
That final citation of Meehl’s 1978 work itself begins by citing Popper’s 1959 book. I don’t know if this is what Turkheimer was intending, but I read it as saying: “this has been happening for decades and it is happening again”.
The second weird trick: molecular genetics
With behavioral genetics having exhausted the attempts to understand E (and long ago lost interest in C) molecular genetics breathed new life into the effort to characterize “A”. Turkheimer reviews the early period of molecular behavioral genetics using Plomin’s hunt for IQ genes as a scaffold. In paper after paper, a molecular study is run, new IQ genes are identified, compelling stories are constructed. Then in the next study those genes fail to replicate. Rinse and repeat for ~15 years. Through this period Plomin is largely undeterred, ending each study with a promise that with just a few more samples the real IQ genes will be found. Here was The Gloomy Prospect in the “candidate gene era”. Every genetic association seems to tell a compelling story, and yet none of the stories ever replicate. But then, a breakthrough comes in Genome-Wide Association Studies: forget testing specific candidate genes in hundreds of samples, collect hundreds of thousands (and eventually millions) of samples and test every single variant, producing millions of highly sensitive correlations. Now the field was cooking with gas, and the “hits” started rolling in. But not just one or a handful of IQ genes; hundreds of them; then thousands; most not even in genes; and each “explaining” barely anything.
I should say at this point that I am a card-carrying GWAS Guy. I think GWAS has expanded our understanding of the biology of many traits. In my lab, I run GWAS, develop methods for running GWAS, interpret GWAS, and arguably contribute to the stamp collecting effort that Turkheimer critiques. I put on my GWAS cologne in the morning and I sleep soundly in my GWAS sheets at night. Like any GWAS guy, I can rattle off a number of important genes that GWAS has discovered or re-discovered: PCSK9, LDLR, CACNA1C and C4A, TP53 and MYC. You wake me up in the middle of the night and I’ll tell you that “human genetic evidence doubles the success rate” of clinical trials, and cite the slew of papers that have demonstrated as much.
But Turkheimer sets a trap for GWAS Guys. He reviews the Educational Attainment (EA) studies: EA1 to EA4, each published in one of the most prestigious journals in our field. He describes the rapidly ballooning sample sizes, reaching several million with EA4, and the increasing number of “hits” that are discovered with each round. You can tell where he is going with this, and if you are a GWAS Guy like me you start to mentally prepare the usual defense: we’re learning mechanisms, we know the effect sizes are small, but these associations give us levers into biology. Then the trap is sprung:
Dear critics, do you happen to remember what those three significant SNPs from the first EA GWAS were? What useful genetic science have they led to? Although significant SNP “hits” of this kind are often referred to pretentiously as “genomic discoveries”, the revelation that SNP rs9320913 accounted for 0.02 percent of the variance in EA was not a discovery in the usual scientific sense. No one remembers it today, no science has been built on top of it, and it has no application in the real world.
“not a discovery in the usual scientific sense” really slides the knife in. Indeed, the top hit from Rietveld et al. (I had to look it up) is in an intergenic region, the closest gene is POU3F2, >600kb away, which may do something in the brain. The next two hits are the genes LRRN2 and AFF3. I promise you none of these are at the top of any list of GWAS success stories. Here was The Gloomy Prospect again.
The situation did not improve from there. In EA2, a rudimentary within-family sign test analysis was carried out to demonstrate that the associations are causal and free of confounding and the results are ambiguous2. In EA3, a more detailed within-family sign test definitively shows that the effect sizes being estimated are inflated, with the bias — the so-called indirect effects — attributed to genetic influences on the rearing environment3. In EA4, a polygenic score analysis finally demonstrates that just a third of the predictive effect is actually acting directly/causally in families4. In Howe et al. (2022), this was further quantified with a heritability estimate, showing that the direct/causal contribution to educational attainment is just 4% (compared to ~12% with confounding); earlier this year, Tan et al. performed an impressive amount of methodological improvement to bump this heritability estimate up to … 7% (14% with confounding). Finally, Nivard et al. (2024) showed that the “indirect” effects are likely not a consequence of “genetic nurture” from parents, but rather some muddle of familial assortative mating and stratification. At each turn, there was The Gloomy Prospect again: a seemingly simple variance component was revealed to be a complicated, confounded mess of genes, environment, and social sorting.
Turkheimer’s prediction for the molecular era was largely correct. I’m not one of those people who shouts at the authors of GWAS papers that they only discover false positives — they clearly replicate. But, for many behavioral traits5, they do so in a largely technical sense: the confounding in the UK is similar to the confounding in the US, so a false positive in one place will replicate in the other. Tweak the environment a little bit — for example, by restricting to individuals with high SES — and even the non-causal predictive accuracy can take a nose dive. Just as with twin studies, the correlations were cataloged and then the causes were left to worry about later. And here we are more than a decade later having learned nothing more about the biology of educational attainment other than that it has something to do with the brain, possibly, uh, neurons.
Everything that has happened since Jensen, Herrnstein’s syllogism, and The Bell Curve has underlined the complexity of the developmental space in which IQ exists. The Human Genome Project has arrived, and nothing resembling “genes for” human intelligence has been found. GWAS has turned the tables on the heritability of intelligence, from the 80 percent presumed by Jensen to something closer to 20 percent now; within-family analyses have reduced the heritability of intelligence even further. The well-understood genetic and neurological mechanisms of IQ differences envisioned by Murray remain a dystopic pipe dream. Polygenic scores for intelligence, especially when they are properly corrected for family-level differences by estimating them within sibling pairs, don’t work well enough to be useful to anyone, or to prompt anyone to think there are powerful deep-seated genetic causes of differences in cognitive ability. The Flynn Effect, in contrast, is both the most dramatic scientific finding about intelligence since the establishment of g, and almost certainly environmental in its causes.
The Gloomy Prospect also shows no sign of abating. In just the past few years it has confounded yet another quantity (so recent, it likely postdates the writing of this book). Allow me a brief digression. For pairs of traits, the similarity in their genetic influences can be summarized in a parameter known as the genetic correlation. Genetic correlations between GWAS traits were initially found to be widespread, implying that many traits may be highly biologically interrelated, driven by shared causes, and possibly even shaped by shared evolutionary factors. But pairs of traits, especially behavioral and psychiatric traits, also exist within the world of The Gloomy Prospect: humans interact and sort based on their behaviors: people with anorexia marry people with anorexia, and they also marry people with depression; their offspring then co-inherit anorexia and depression genes that would otherwise be independent; these genes and parental environments in turn shape the offspring environment, and so on over generations.
It turns out these cross-trait spousal relationships are significant and widespread, as recently shown by Border et al. (2022). In fact, the spousal correlations are so large that they may explain a large fraction of the observed genetic correlations for many traits. Variants that have a direct effect on only one trait will falsely appear to influence both. Functional elements (e.g. expression in the brain) that are only enriched for the causal heritability of one trait will falsely become enriched for both. And all of these false correlations are actually due to culture, not genes. The Gloomy Prospect, multiplied (and if you think this stops at pairs of traits, see Border et al. (2024)).
Known unknowns
There is a temptation to think that genetic variation can be the weird trick that provides us with a causal manipulation in humans. This is, I suspect, why there is so much excitement about polygenic scores in the social sciences: a new workaround! At a technical level it is simply not true, the arrow goes both ways: environment also causes genes through assortative mating and cultural transmission (i.e. makes genetic variation correlated with other processes it would otherwise be independent of). But, as Turkheimer argues convincingly, even if the arrow only went out from genes, when “genes” mean thousands of individually tiny effects that interact with millions of environments — the daily GxE interactions we call the human condition — then we effectively lose our weird trick. Look at the figure below: Suzy brought a hose to the bucket and Billy turned the tap on. What can we learn about how the bucket gets filled if we only measure — with great precision — how Billy turns on the tap, while averaging over thousands of different unmeasured Suzys? Or, in the case of classic twin models, simply assuming Suzy doesn’t exist.
Turkheimer often frames his model as a null hypothesis: non-zero heritability mediated by complex environmental interactions should be the default assumption, non-zero genetic correlation should be the default assumption, etc. A null hypothesis is a useful analogy because the null cannot be proven, it can only be tested against. It is a kind of “known unknown”. But this is also where the argument hits an inherent limit: how do you prove something like The Gloomy Prospect exists? Turkheimer is a pioneering figure in this field who has been studying these questions for decades. He thoughtfully recounts the sordid history, the post-war shift to credibility, and the modern data-driven rigor of GWAS. He draws out multiple effective analogies for why genetic variance component estimates are useless as indicators of behavioral mechanisms. He surveys decades of studies and null findings. Here is a respected expert telling you that there’s nothing there to be discovered with simple correlative methods; that The Gloomy Prospect cannot be ignored. But, at the same time, one cannot prove a null. When Turkheimer says “the problem turned out to be not that heritability was imaginary, but rather that it was misinterpreted as supporting a hereditarian model of human behavioral differences, when in fact it does not”, he cannot point the reader to a mathematical proof or a twin estimate that says “heritability: not imaginary but not hereditarian”. At a certain level you have to take him at his word. Don’t get me wrong, it is important for his word to be out there. And I suspect that other scholars will regularly need to take up the charge and explain, yet again, why heritability does not imply hereditarianism and all of that. But the inability to prove a null hypothesis — and the many many ways in which the alternative hypothesis and the millions of correlations can be misleading — is why this debate persists.
The Gloomy Present
So what is to be done? Turkheimer closes the book with a call for the field to focus on what he calls “essence heritability”: how much of the genetic effect on a trait actually functions through recognizable biological systems. Huntington’s Disease, an autosomal dominant disorder driven by a mutation that causes neurons to fall apart, has high essence heritability. IQ, a tangle of abstract reasoning test results that appears to be influenced by thousands of variants loosely enriched for “the brain”, would (as of today) have an essence heritability near zero. There are echoes here of Ned Block’s distinction between direct heritability (acting through clear biological processes) and indirect heritability (acting via environmental interactions). Both concepts are still underdeveloped — what does it really mean for a biological system to be “recognizable”? — but the core idea is that we should care about mechanisms and not just correlations; the latter being just one tool to get to the former. This emphasis on mechanisms may be one way molecular genetics can avoid becoming yet another Galtonian engine for correlations. The gravy train.
In my opinion (and I probably diverge from Turkheimer here), we can also get better at defining and estimating molecular quantities. We know that some traits exhibit largely direct effects while others are substantially indirect, implicating some kind of confounding. Some traits with large indirect effects additionally exhibit low genetic correlation between their direct and indirect effects, suggesting that the processes within families are fundamentally different from those between families. The heritability of some traits drops significantly after adjusting for geographic clustering, or when restricting to specific environments. These are still variance components — they do not tell us about essence — but they can place some constraints on the null hypotheses. In time, we could even imagine turning these into mechanistic parameters that get closer to essence: How much of the trait can be explained by direct, well-understood biological pathways versus diffuse interaction networks? How much by the interaction of genetic variation with recognizable and measurable environments? Rare variants, which appear to explain little in total in terms of classic heritability but operate through more recognizable mechanisms, may be of particular value in getting to essence heritability (though we should also be somewhat apprehensive about largely untested, shiny new toys). Just as important, mechanistic parameters can tell us which traits are outside the reach of our crude genetic instruments and fundamentally require manipulation.
Finally, there is another gloomy prospect that is yet to be addressed. Turkheimer naturally focuses on behavioral traits that are inherently interactive and can be easily conceptualized as Gloomy: intelligence, education, divorce, etc. But the big question — the $3 billion dollar question — is the extent to which The Gloomy Prospect will be an insurmountable challenge for more conventional traits and disorders. Is depression a complex network of environmental inputs and genetic interactions? Is obesity? Is cancer?
Understanding the Nature-Nurture Debate is available from Cambridge University Press and Eric Turkheimer also writes at his Gloomy Prospect blog.
This is only true in broad strokes. A major challenge with adoption studies is selective placement inducing correlated environments. And, indeed, in this study [Kendler et al. (2015)] the education of the biological and adoptive parents was also significantly correlated. Not to mention that families giving up or receiving children for adoption are, by definition, atypical.
“The results from these analyses consistently suggest that unaccounted for stratification biases are unlikely to account for more than a modest share of the observed inflation in the 𝜆𝐺𝐶 in the pooled EduYears analysis” ~ Okbay et al. (2016) - Supplementary Note 1.6
“Although the evidence is not conclusive, it suggests that the GWAS effect-size estimates may be biased upward by correlation between educational attainment and a rearing environment conducive to educational attainment.” ~ Lee et al. (2018)
“For predicting EA, the ratio of direct to population effect estimates is 0.556 (s.e. = 0.020), implying that 100% × 0.5562 = 30.9% of the PGI’s R2 is due to its direct effect” ~ Okbay et al. (2022).
By which I mean any trait that undergoes cultural transmission and/or assortative mating.
Another fantastic post Sasha. I am always interested in what Turkheimer has to say, and having you present his work makes it twice as good.
Does he talk about FGWAS at all? In my opinion, it is a major methodological breakthrough that can control for a lot of confounding that plagues standard GWAS for behavioral traits.
His concept of "essence heritability" is an interesting one, and as a mechanism-oriented molecular systems biologist, I am sympathetic to it. However, to me "predictive heritability" is what really counts. If a variant truly influences a trait, then its presence or absence can be used to predict phenotype in a population. Pragmatically speaking, PGIs can complement existing clinical scoring systems.
Overall I am a lot less pessimistic than Turkheimer, and one reason for optimism are the "young guns" in the molecular human genetics field like you and Alex. I was always a bit dubious about some of the older human geneticists, and I won't name any names but you know who I am referring to, and so it is good to see the field moving in a more positive direction. There are a lot of exciting discoveries to be made.
Reading research and accounts on Mathematical Circles in the USSR convinced me of near-zero essence heritability for mathematical ability while at the same time being, in my opinion, a strong cause for optimism. At least for traits within the "education/intellectual skill-building" domain, this form of socialization seems to be a key ingredient. Of course, like most things it's likely significantly confounded and more complicated than "socialize kids properly, and they'll turn into Kolmogorov" (and along with being hard to quantify/test, it's probably not amenable to causally-informative research designs) but it can at least be somewhere to start, and may make us slightly less gloomy.