Beneath the surface of the sum
When genetic interactions matter and when they don't
Biology is full of interactions: genes regulate other genes, proteins form into complexes, cells exchange signals through receptors, tissues coordinate their function through hormones, and so on. And yet genetic scans for interactions — or epistasis, from the Greek for “stoppage” — hardly identify anything at all. To understand this seeming contradiction, it is important to distinguish between (a) biological epistasis, the actual structure of the causal effects driving a trait; and (b) statistical epistasis, the way correlations show up in different quantitative models. Let’s start with a simple model of biological epistasis and see how it manifests through heritability, natural selection, and intervention.
Biological epistasis mostly looks statistically additive
The simplest form of epistasis is one where the trait is only the product of effects at two different loci (A and B):
Here, one locus amplifies or dampens the influence of the other through some interactive process, aka biological epistasis. We can additionally think of statistical epistasis as the total genetic variance that is not captured by a simple additive model where y ~ A + B. If there is one overarching theme to this post, it is that biological and statistical epistasis are not the same. Let’s visualize how this simple biological interaction looks across different genetic configurations and how much of it is estimated to be additive:

To orient ourselves, we first simulate two loci with no epistasis at all. By definition, the effect of the first locus (A) does not depend on carrier status of the second locus (B) and so the genetic effects / slopes in the left-most panel are parallel. In the presence of epistasis, however, the slopes of the three B-locus carrier groups shift away from parallel: the effect in the BB carriers goes to zero while the effect in the bb carriers is greatly amplified. If we fix the total amount of epistatic variance, we can see that the deviations from parallel slopes depend on the allele frequency. For variants with a 5% allele frequency, just 17% of the epistasis is captured by the additive model, whereas for variants with 50% frequency the additive model captures 80% of the epistatic effect.
There are more complicated models, for example we can center the two loci such that the effect of their product in the population is zero and only the deviations from the mean increase/decrease the phenotype. In that scenario, a phenotype-increasing allele in one context can be a phenotype-decreasing allele in another and thus completely uncorrelated with the additive effects. Such models are statistically convenient but they are also less interpretable since no biological “mean centering” force exists in the population (at least under neutrality). They can also be fully recapitulated from combinations of sums and products of alleles in the model visualized above. So let’s stick with the simple epistasis model and see how it impacts other estimators.
Genome-wide epistasis mostly looks like additive heritability
We now move beyond the two locus model and ask how much genome-wide epistasis can be captured as additive (aka “narrow-sense”) heritability. To do so, we will simulate heritable traits with an increasing number of causal genetic variants that only operate through pairwise biological epistasis; i.e. every variant interacts with every other variant in its influence on the trait. Then we will estimate narrow-sense heritability in unrelated individuals and see how close we get to the true total heritability (you can think of this as estimating the additive variance explained by each polymorphism alone and then summing it up). Each interaction has a random effect size which can be positive or negative and there are upwards of thousands of interactions, so we might now expect most of the biological epistasis to be missed by the additive model.

Not so! Genome-wide epistasis at common variants is largely captured by narrow-sense heritability, just as in the two-locus analysis above. In fact, when causal variants are sampled uniformly from the full frequency spectrum (right-most point), the narrow-sense heritability recovers more than 80% of the biological epistasis. It is only when all interacting alleles are below 20% frequency that less than half of the epistatic heritability is recovered. We can also see that these findings are mostly insensitive to the number of causal variants, as long as it is reasonably large or there are no dominant/recessive effects.
bility estimates would (incorrectly) report the majority of the variance to be additive and produce accurate out-of-sample predictions from additive models. As Huang & Mackay point out, this is a fundamental problem with variance partitioning models in the presence of interactions: the model has to assign the interacting variance somewhere, and the most parsimonious place is not guaranteed to be the right place1.
In twins, epistasis makes the shared environment look like genes
With that in mind, let’s consider how epistatic effects influence narrow-sense heritability estimates from different quantitative genetic models. In the above simulations, we used only unrelated individuals — analogous to molecular/GWAS analyses with all causal variants measured — but what happens in classic twin and family-based studies? While additive genetic effects decay linearly with the genotypic correlations between two individuals — MZ twins share 100% of their additive effects, DZ twins share 50% of their additive effects and so on — pairwise epistatic effects decay quadratically, so MZ twins share 100% of their interactions but DZ twins share just 25% of their pairwise interactions (because they need to share both pairs). This implies that the component of pairwise epistasis that is not captured by additive variation will decay quadratically with decreasing genetic relatedness. To get a sense of this decay, we can visualize the relationship of the phenotypic/genotypic correlation for a trait with 20% biologically additive heritability and 40% biologically epistatic heritability, with the latter either largely tagged by the additive component (orange, as in the uniform causal variant simulation above) or largely untagged (red, as in the low frequency simulation above):

When epistasis is largely tagged by the additive genetic component, narrow-sense heritability estimated across the relatedness spectrum is close to the total (aka “broad-sense”) heritability of 60%: slightly lower when estimated from unrelated individuals towards the left (52%) and slightly higher when estimated from twins towards the right (64%). On the other hand, when epistasis is largely non-additive (red line), the estimate of heritability from unrelated individuals is very close to the additive component alone (24%), whereas the estimate from twins is actually substantially inflated over even the broad-sense heritability (78% when the truth is 60%). This inflation (which is well known in the twin literature) arises from the fact that the phenotype/genotype slope between MZ and DZ twins is steeper than the additive model expects, due to the fact that DZs now share many fewer interactions.
The upward bias in estimated narrow-sense heritability has to go somewhere, and in the ACE twin model it also leads to a downward bias in the estimated contribution of the shared environment (i.e. too much of the DZ correlation is assigned to genes)2. To get a feel for this, it helps to look at MZ/DZ correlations from real data. Recently, Eftedal et al. (2025), conducted a massive analysis of school performance and estimated MZ and DZ correlations of 0.86 and 0.49 respectively, corresponding to a conventional ACE additive heritability estimate of 74% and a shared environment estimate of 12%. Notably, these correlations were not consistent with other pedigree based heritability estimates in the same data and non-additivity was hypothesized as an explanation. Indeed, the same MZ/DZ correlations could also correspond to a trait with 20% additive heritability, 30% shared environment, and an epistatic heritability of 35% — a completely different picture!

In short, biological epistasis that is not captured by statistical additivity will inflate narrow-sense heritability estimates and deflate the shared environment estimates in twins. In the context of our unscaled model, such epistasis has to be driven largely by low frequency variants.
Epistasis below the phenotype
So far we have thought about epistasis at the level of (many) individual pairs of variants influencing the trait. We can instead imagine epistasis across individual heritable biological pathways that lead to the trait. In the case of two interacting pathways, all of their underlying causal variants become involved in pairwise interactions and the above variant-level implications continue to hold. What about more complicated relationships? Zuk et al. (2012) proposed a “limiting pathways” model, where multiple pathways driven by additive genetic variation are then constrained by some “rate-limiting” step before they manifest as the phenotype (for simplicity, Zuk et al. take the minimum of the pathways to define the phenotype). As an example, imagine that mood is influenced by multiple different brain pathways and manifests as clinical depression based on which pathway is the “lowest”. Even though each pathways is additive, this rate limiting step induces epistasis across them, which in turn leads to what the authors call “phantom heritability” — differences in the narrow-sense heritability estimate from twins and the true narrow-sense heritability in the population:

As the number of involved pathways grows, the phantom heritability grows as well. Even more striking, when there is a true shared environmental effect, the twin-based narrow-sense heritability estimate is actually inflated with the number of pathways (due to the upward bias described previously). For realistic MZ/DZ correlations, very large amounts of phantom heritability generally also imply a large (and underestimated) shared environment component, as demonstrated in simulations by Stringer et al. (2013). For instance, the MZ/DZ correlations in school performance we saw previously would be compatible with a k=4 limiting pathways model with a true narrow-sense heritability of 20-40% and a shared environment estimate of 40-60%.

Epistasis under selection
It is said that we exist in the context of all that came before us and it can therefore be useful to think about what happens to genetic epistasis over the course of human evolution. Now, the relationship between epistasis and fitness starts to matter: epistasis can be random/non-directional, positive (allele pairs amplify each other in the direction of fitness), or negative (the reverse). The broad consequences of these forms are summarized in simulations from Hansen (2006):

In a single generation the average response to selection (i.e. the expected change in phenotype in the next generation) only depends on the narrow-sense heritability, since only the additive component determines the trait mean in the next generation. For a fixed broad-sense heritability the response to selection will therefore be slower when there is more untagged epistasis and the narrow-sense heritability is lower3. When epistasis is random, the long-term response to selection is also very similar to that of an additive architecture, consistent with theory4. Over time, both the additive and epistatic variance shrinks (for example as alleles are fixed in the population) and the response to selection slows down. In short, random epistasis is largely irrelevant for thinking about selection on average5.
But the long-term response behaves very differently if epistasis is directional, that is, aligned with trait fitness in some way. Positive directional epistasis can greatly increase the selection response beyond what is initially anticipated. As allele frequencies shift around, the epistatic variance is converted into new additive variance, which in turn increases the response to selection and so on for many generations. Eventually a plateau is reached, additive variance again decays, and the response to selection slows down. In contrast, negative directional epistasis greatly constrains the long-term phenotypic response by causing additive variance to decay faster while the epistatic variance actually grows (the reverse of what happens with positive epistasis)6. These unexpected phenomena are sometimes referred to as the “evolution of evolvability”, in that a trait which initially appears to respond to selection slowly (have low additive variance / “evolvability”) may actually be able to reach much greater levels of response over time. The directional epistatic components are hiding actionable additive variation, which then becomes revealed over multiple generations of selection.
Epistasis makes interventions unpredictable
Where distinguishing epistasis from additivity really matters is in the context of individual-level interventions. Let us imagine we want to drive individuals to a level below the population mean on a certain trait, say cholesterol, by acting on the genetic mechanisms. For each individual, we can either nullify the variant they carry that has the largest positive additive effect in the population, or the pair of variants they carry that have the largest positive epistatic effect in the population. We do this iteratively for each individual until we get everyone below the target. For a trait with high statistically non-additive epistasis (i.e. a low-frequency architecture), here is what that looks like:

When we act on individual variants with the biggest main effect, the response is often unexpected (shown in red): delete a variant that is associated with increasing the trait, assume the trait will now go down, and yet the trait actually goes up even further. In a few cases, individuals fluctuate so much above zero that they run out of editable variants. On the other hand, acting on the epistatic mechanisms directly (right panel) can reliably and monotonically drive all of the individuals below the mean, as expected.
As a final sanity check, we can instead simulate a trait for which the epistasis is largely tagged by additive effects (by selecting causal variants from the full frequency distribution). For this largely (statistically) additive trait, both the additive interventions and the epistatic interventions generally lead to monotonic changes in the phenotype in the desired direction and the response to intervention is much more predictable:

Reconciling models of epistasis with the data
The estimates of heritability from twins, families, and adoptees do not agree. The differences between twin and family models have been hypothesized as evidence of a non-additive genetic component. However, large genetic analyses of common variants have effectively ruled out the contribution of dominance7 and widespread pairwise epistasis8. How can we explain these contradictions?
One possibility is that epistasis is substantial but — for reasons — exclusive to low-frequency variants and therefore neither tagged by additive genetic variance nor estimable from existing molecular studies. As a consequence, genetically driven interventions that do not account for epistasis will have highly unpredictable effects in individuals, including routinely moving their phenotype in the opposite direction to what is intended. If epistasis is directional, it can also greatly alter the long-term response to selection, likely reducing it9. A second possibility is that traits actually consist of multiple sub-phenotypes undergoing a non-linear bottleneck, as in the limiting pathways model, which then creates epistasis at the variant level. In this case, the genetic causes of a clinical condition in one family will often be profoundly different from the genetic causes of the same condition in another (“All happy families are alike; each unhappy family is unhappy in its own way”). Interventions that do not account for the individual-level limiting pathway will often fail, and treatment effect heterogeneity should be widespread. In both scenarios, narrow-sense heritability from twin studies is inflated, the influence of the shared environment must typically be large, and the impact of genes across generations decays much faster than expected. Of course, there is also the third possibility that twin model estimates are simply inflated by some other form of environmental confounding.
Fully resolving this question hinges on our understanding of rare variant architecture, for which there is currently insufficient statistical power to identify interactions directly (and, for non-founder populations, quantitative genetic models may never have enough power; see simulations/theory in Young et al. 2014). But we can get a kind of preview by looking for interactions between rare variants and common variant effects. Specifically, several studies have now tested for interactions between established rare pathogenic disease variants and common variant polygenic scores for the same disease, the latter aggregating all known additive effects into a single continuous value. Here’s what the interaction between rare and common variants looks like for three very different traits (fluid IQ, heart disease, and breast cancer):

Surprisingly, even these rare-common relationships look remarkably additive (and I’m not saying “surprisingly” here as a rhetorical device, these findings were genuinely surprising). In Kingdom et al. (2024), carriers of known developmental delay genes had about a 0.17SD lower fluid IQ measurement on average, yet carriers with an EA polygenic score above the 70th percentile were back at average fluid IQ, and the rare-common relationship was similarly additive for many related behavioral traits10. One might imagine that serious, genetically driven development delays would dampen the impact of whatever a more general polygenic score for education is capturing — yet remarkably that is not the case. Likewise, one might expect that disrupting key DNA damage repair genes like BRCA1/2 would predisposed one to cancer in a fundamentally, biologically different way to the rest of the population — and yet Fahed et al. (2020) show that the common polygenic score for breast cancer risk behaves identically in carriers and non-carriers and a very low risk score can nearly rescue the carriers11.
Yet again, the family-based models point at something that the molecular data does not show.
“The crux of the problem is the undesirable feature of the classical model as well as the alternative parameterizations that there is not a one-to-one correspondence between gene action at underlying quantitative trait loci and the partitioning of variance components except under very specific and restrictive circumstances.” ~ Huang & Mackay (2016)
Doing the math for pairwise epistasis, the estimated narrow-sense heritability from a twin ACE model is increased by +1.5x of the epistatic heritability and the estimated shared environment is decreased by -0.5x of the epistatic heritability. For higher order epistasis, the inflation approaches +2x and -1x respectively. Note that pairwise epistatic effects can alternatively be estimated with the twin ADE model, but this requires the additional assumption that the contribution of the shared environment is null (and no higher order epistasis).
“For a given initial total genetic variance, 𝑉0_𝐺, the initial response is slower with epistasis, because it is proportional only to the additive component, 𝑉0_𝐴. However, as allele frequencies fluctuate randomly, epistasis generates additional additive variance, such that the total selection response is the same.” ~ Paixão and Barton (2016)
“… pure non-directional epistasis has a response that is essentially identical to that of an additive architecture. The slight differences that are observed may be due to some random directional epistasis generated by random sampling of epistasis coefficients or by genetic drift. Thus, the theoretical prediction that the selection response is not affected by non-directional epistasis is valid for hundreds of generations until the selection limit is reached. This pattern appears robust to changes of parameters and initial conditions, except if the strength of epistasis is increased by orders of magnitude” ~ Carter et al. (2005)
“If epistatic interactions are random with respect to the marginal effects on the trait, and if the optimal genotype is the same under the epistatic and the corresponding additive models, then epistasis has no expected effect” ~ Barton (2016)
In principle, epistasis can also create a “rugged” fitness landscape where specific combinations of alleles greatly increase or decrease fitness and trap the selective process in local maxima. In practice, this is uncommon when drift is sufficiently large:
“Epistasis can sustain multiple ‘adaptive peaks' that can trap populations in suboptimal states. However, when selection on each allele is comparable to drift (that is, Nes≲1), random fluctuations allow populations to evolve more or less freely across rugged landscapes.” ~ Barton (2016)
“Of course, if epistasis is systematically positive, there will be an accelerating response, and a much larger total change than with the original additive effects; conversely, systematically negative epistasis leads to a smaller selection response” ~ Barton (2016)
Pazokitoroudi et al. (2021) and Palmer et al. (2023) both estimated genome-wide dominance heritability across 50 and 1,060 traits respectively in a large biobank and found a mean of ~zero with very high certainty.
Jabalameli et al. (2025) found no evidence for epistasis across ~1,000 common variants associated with height in a massive study of ~3.6 million 23andme participants. Fu et al. (2023) quantified “marginal epistasis” between each individual variant and all genome-wide variants in the UK Biobank and identified less than one significant example per tested trait. Hemani et al. (2014) initially identified a large number of significant epistatic interactions influencing gene expression, but these were later revealed to be largely a mix of imperfect tagging of causal variants and inflation in the test statistic and the paper was eventually retracted. Finally, there is the invisible graveyard of attempted projects in this area that have been presented at conferences and in poster sessions but never materialized.
We can speculate that selection will likely be reduced based on the fact that rare variants are more often deleterious and with the additional (perhaps strong) assumption that epistasis between two deleterious variants is also likely to be deleterious.
“Rare variant carrier status and EA-PGS appeared to have an additive effect when assessed against multiple related traits, with the effect of rare variants remaining similar throughout the EA-PGS spectrum.” ~ Kingdom et al. (2024)
“Within the limits of statistical power, the impact of the polygenic score appeared similar in carriers and noncarriers, odds ratio per standard deviation increment of 1.44 (1.19–1.74) and 1.57 (1.49–1.65), respectively, p-interaction = 0.94 (Wald Test and Methods).” ~ Fahed et al. (2020)


I don't think it should be surprising that most genetic effects combine roughly linearly. If the effects were highly nonlinear then genetic selection wouldn't work well (just ask anyone who's tried to maximize highly nonlinear functions of many variables).
Been waiting for this, great discussion!