Phenotype integration improves power and preserves specificity in ...

Abstract

Biobanks often contain several phenotypes relevant to diseases such as major depressive disorder (MDD), with partly distinct genetic architectures. Researchers face complex tradeoffs between shallow (large sample size, low specificity/sensitivity) and deep (small sample size, high specificity/sensitivity) phenotypes, and the optimal choices are often unclear. Here we propose to integrate these phenotypes to combine the benefits of each. We use phenotype imputation to integrate information across hundreds of MDD-relevant phenotypes, which significantly increases genome-wide association study (GWAS) power and polygenic risk score (PRS) prediction accuracy of the deepest available MDD phenotype in UK Biobank, LifetimeMDD. We demonstrate that imputation preserves specificity in its genetic architecture using a novel PRS-based pleiotropy metric. We further find that integration via summary statistics also enhances GWAS power and PRS predictions, but can introduce nonspecific genetic effects depending on input. Our work provides a simple and scalable approach to improve genetic studies in large biobanks by integrating shallow and deep phenotypes.

Main

Although major depressive disorder (MDD) is the most common psychiatric disorder and the leading cause of disability worldwide, its causes are largely unknown^1,2. Despite the moderate familial heritability of MDD (~40%)³, genome-wide association studies (GWASs) have only recently begun to identify replicable risk loci and polygenic risk scores (PRSs)^{4,5,6,7,1.2 million individuals highlight new therapeutic directions. Nat. Neurosci. 24, 954–963 (2021)." href="https://www.nature.com/articles/s41588-023-01559-9#ref-CR8" id="ref-link-section-d88530019e836_4">8,9}. These discoveries were enabled by increasing power along two primary dimensions: depth of phenotyping and sample size¹. Increasing sample size improves both GWAS and PRS power by reducing the standard errors of estimated genetic effects on a given MDD phenotype^4,10. Alternatively, increasing diagnostic accuracy through structured clinical interviews prevents dilution of genetic effect sizes, thus improving GWAS power^1,9,11 and PRS accuracy^11,12. In practice, studies have a fixed budget and must trade off between increasing sample size or phenotyping depth. The optimal choice for current and future MDD studies remains contested^11,13,14. Ultimately, the choice will depend on the study's goals.

One important goal is statistical explanation, defined as the number of GWAS hits or the PRS prediction accuracy. Most MDD GWASs have focused on this goal, which is best achieved by maximizing sample size^11,12. This motivates the use of shallow phenotypes in large biobanks, including self-reported depression and treatment^5,7. Sample sizes are often further increased by including health record information of seeking care for depression (for example, the Integrative Psychiatric research consortium (iPSYCH¹⁵) and the Million Veterans Program^{1.2 million individuals highlight new therapeutic directions. Nat. Neurosci. 24, 954–963 (2021)." href="https://www.nature.com/articles/s41588-023-01559-9#ref-CR8" id="ref-link-section-d88530019e903">8}). These studies have amassed sample sizes of millions of individuals and have identified hundreds of risk loci, as well as PRSs with state-of-the-art prediction accuracy in European-ancestry clinical cohorts^{4,5,6,7,1.2 million individuals highlight new therapeutic directions. Nat. Neurosci. 24, 954–963 (2021)." href="https://www.nature.com/articles/s41588-023-01559-9#ref-CR8" id="ref-link-section-d88530019e910">8}.

A partly distinct goal is biological insight. This is more difficult to measure or even define, but it represents one of the ultimate goals of genetics: characterizing biological mechanisms to improve prediction and treatment for all. This goal may never be achieved by increasing sample size with shallow phenotyping, because shallow phenotypes are confounded by genetic effects that do not pertain to MDD biology¹¹. In contrast, deep phenotyping in clinical cohorts (for example, the Psychiatric Genomics Consortium (PGC)⁴ and the China, Oxford and VCU Experimental Research on Genetic Epidemiology (CONVERGE)⁹) has identified a handful of replicated genetic loci that could potentially generate hypotheses on MDD-specific biology. However, this has not yet been demonstrated robustly, as current sample sizes simply do not provide the power to yield enough genetic signals for definitive biological inferences⁹.

In this paper, we propose to bridge the shallow–deep gap by integrating information across hundreds of MDD-relevant phenotypes in UK Biobank^11,16 (Fig. 1). We focus on using phenotype imputation^17,18 to increase the effective sample size for the deepest MDD phenotype in UK Biobank (LifetimeMDD)¹¹, which dramatically improves GWAS power and PRS accuracy over any individual MDD phenotype¹⁹. We extensively characterize the genetic architecture underlying these imputed phenotypes and show that they remain specific to LifetimeMDD. Further, we develop a novel approach to partly remove nonspecific signals from GWASs on shallow phenotypes akin to latent factor corrections in expression quantitative trait locus (eQTL) studies^20,21,22,23. We also investigate phenotype integration via GWAS summary statistics using multi-trait analysis of GWAS (MTAG), which offers varying specificity and sensitivity depending on input choices. Finally, we developed a novel metric to quantify the specificity of a given PRS, which demonstrates that imputed deep phenotypes of MDD are both more specific and more sensitive than observed shallow phenotypes.

Results

Phenotype imputation increases effective sample size

We focused on the deepest available measure of MDD in UK Biobank¹¹, LifetimeMDD, which we derived by applying clinical diagnostic criteria in silico to MDD symptom data from the Patient Health Questionnaire 9 (PHQ9) and the Composite International Diagnostic Interview Short Form (CIDI-SF) in the online Mental Health Questionnaire (MHQ). This procedure identified 16,297 LifetimeMDD cases and 50,867 controls. Because most individuals did not complete these questionnaires, LifetimeMDD was missing for 269,962 individuals. We also studied a shallow measure of MDD, GPpsy¹¹, defined as seeking help from a general practitioner (GP) for "depression, anxiety, tension, or nerves". For imputation and downstream analyses, we used a broad depression-relevant phenome with 217 phenotypes, including comorbidities and family history, as well as socioeconomic, demographic and environmental phenotypes (Supplementary Note and Supplementary Table 1).

We first imputed the depression phenome using SoftImpute²⁴ (Methods), a variant of principal component analysis (PCA) that identifies latent factors from observed data and uses them to impute missing data. We previously found SoftImpute to be the most scalable among several established approaches^17,24. We tuned SoftImpute's regularization parameter using realistically held-out test data by taking unions of missingness patterns across samples¹⁷ and also used this approach to estimate the imputation accuracy for each phenotype (Extended Data Fig. 1)¹⁷. Imputation accuracy varied widely across phenotypes, ranging from R² = 1% for being a twin to R² = 97% for neuroticism score. For LifetimeMDD (80% missing), we estimated the phenotype imputation R² to be 40%. This roughly translates to doubling its effective sample size^17,25 (n_observed = 67,000, n_effective = 166,000; Methods and Extended Data Fig. 1). Imputation accuracy was comparable when stratifying by sex, which is a significant MDD risk factor^26,27,28 (Supplementary Fig. 1 and Supplementary Table 2). We found that the imputed measures had deflated variances and inflated correlations (Supplementary Note and Supplementary Fig. 2), as expected¹⁷. This effect could bias some downstream tests, such as tests for genetic correlation. One main goal of this work was to determine whether this approach to phenotype imputation is biased for large-scale single-trait genetic studies.

Finally, we applied a new deep-learning imputation method, AutoComplete²⁹, to the same phenotype matrix (Methods). AutoComplete improved estimated imputation accuracy for most phenotypes with &gt;10% missingness (29 of 42) and increased the average estimated by R² by 2.9%.

Phenotype imputation improves GWAS power

We performed GWASs on observed LifetimeMDD (n = 67,164), imputed values of LifetimeMDD (ImpOnly, n = 269,962) and the concatenation of imputed and observed LifetimeMDD (ImpAll, n = 337,126; Fig. 2 and Methods). GWAS on the observed values of LifetimeMDD identified one significant locus (Fig. 2e). In GWASs on the imputed values, the number of GWAS loci increased to 13 and 18 for SoftImpute and AutoComplete, respectively (Fig. 2a, b and Supplementary Table 3). Finally, in GWASs on the combination of both imputed and observed values, the number of significant loci further increased to 26 and 40 for SoftImpute and AutoComplete, respectively (Fig. 2c, d and Supplementary Table 3). We confirmed that these improvements in the number of GWAS hits over the single hit from observed LifetimeMDD were unlikely to occur by chance (Supplementary Fig. 3).

**Fig. 2: Genetic architecture of observed and imputed LifetimeMDD.**

We investigated whether the new GWAS hits from phenotype imputation were MDD specific by comparing the ImpOnly GWASs to other MDD GWASs. First, we compared the two imputation methods. Of the 13 and 18 ImpOnly GWAS loci for SoftImpute and AutoComplete, respectively, 8 overlapped (giving a total of 23; Extended Data Fig. 2). Further, 9 of the remaining 15 loci had P &lt; 10⁻⁵ in both ImpOnly GWASs, and all 15 of the 23 loci had P &lt; 0.05/23. Overall, our two imputation methods captured highly overlapping genetic signals, but AutoComplete identified more loci. Next, we assessed the eight shared hits in four nonoverlapping depression cohorts (Methods and Supplementary Note): observed LifetimeMDD in UK Biobank, self-reported depression diagnosis or treatment in 23andMe⁷, the 29 MDD cohorts of the PGC⁴ (PGC29) and Danish registry data on MDD cases and population controls (iPSYCH^15,30). For reference, we also compared our findings to those for neuroticism in UK Biobank, a personality trait that is genetically correlated with, but distinct from, MDD³¹. We found that all eight hits shared by both ImpOnly GWASs had sign-consistent effect sizes across all four depression cohorts, as well as in neuroticism. Moreover, all eight hits had P &lt; 0.05/23 for observed LifetimeMDD in UK Biobank. Finally, of the 23 single-nucleotide polymorphisms (SNPs) significant in either of the ImpOnly GWASs, 18 replicated in at least one GWAS of observed MDD at P &lt; 0.05/23 (Fig. 2g). Altogether, these results show that the loci underlying imputed LifetimeMDD are relevant to MDD.

We then checked whether the ImpOnly GWAS preserved the polygenic architecture of LifetimeMDD in terms of heritability and genetic correlation. First, we found that the liability-scale SNP-based heritability (\({h}_{{\rm{g}}({{\rm{liab}}})}^{2}\) from LDSC³²) was lower for imputed LifetimeMDD (Soft-ImpOnly \({h}_{{\rm{g}}({{\rm{liab}}})}^{2}\) = 13.1%, standard error (SE) = 1.0%; Auto-ImpOnly \({h}_{{\rm{g}}({{\rm{liab}}})}^{2}\) = 14.0%, SE = 1.1%) than for observed LifetimeMDD (\({h}_{{\rm{g}}({{\rm{liab}}})}^{2}\) = 19.0%, SE = 2.9%; Fig. 2f). Nonetheless, the genetic correlations between imputed and observed LifetimeMDD were near 1 (Soft-ImpOnly: r_g = 0.97, SE = 0.02; Auto-ImpOnly: r_g = 0.96, SE = 0.03), as it was between the two imputation methods (r_g = 1.00, SE = 0.004). Moreover, the r_g between the ImpOnly phenotypes and other depression phenotypes largely mirrored the r_g based on observed LifetimeMDD (Fig. 2h).

Finally, we tested for effect size heterogeneity between the ImpOnly and observed LifetimeMDD GWASs. We used a random-effect meta-analysis³³ (Methods), as the ImpOnly and observed LifetimeMDD GWASs used nonoverlapping individuals. We found no significant heterogeneity in effect size between ImpOnly and observed LifetimeMDD at genome-wide significance (Extended Data Fig. 2), and across the 13 and 18 GWAS hits in Soft-ImpOnly and Auto-ImpOnly, respectively, 6 and 4 SNPs showed significant heterogeneity at P &lt; 0.05/23. Altogether, our results show that imputed LifetimeMDD is noisier than observed LifetimeMDD but captures similar genetic effects.

Phenome-wide factors index pleiotropic axes of depression risk

We examined the top latent factors in SoftImpute to understand what phenotypic correlations drive the imputation. We used two statistical metrics to prioritize significant factors. First, we quantified the phenome-wide variance explained (Methods and Fig. 3a): the top handful of factors clearly stood out, with factors becoming comparable to background noise levels around factor 30. Second, we quantified factor stability by calculating the R² between factors estimated on separate halves of the data, similar to prediction strength in clustering³⁴ (Methods and Fig. 3b). We found that the first ten factors were very stable (min R² of ~ 98%), with stability decaying steadily thereafter (factors 11–20, average R² of ~ 80%; factors 21–30, average R² of ~60%). We conservatively conclude that the first ten or so factors are statistically meaningful.

**Fig. 3: Characterizing top latent factors driving SoftImpute.**

We studied the genetic basis of each factor with GWASs. The number of GWAS hits ranged from 3 (factor 2, mainly indexing age-related phenotypes; Fig. 3c) to 309 (factor 3, mainly indexing socioeconomic status and education attainment), with λ_GC ranging from 1.15 (factor 2) to 2.11 (factor 3) (Supplementary Fig. 5). We next estimated heritabilities and found they ranged from \({h}_{{\rm{g}}}^{2}\) = 1.9% (SE = 0.2%) for factor 2 to \({h}_{{\rm{g}}}^{2}\) = 22.4% (SE = 0.9%) for factor 3 (Fig. 3d and Supplementary Fig. 4). Finally, we profiled the genetic correlation between factors and MDD-related phenotypes (Fig. 3e and Supplementary Fig. 4). We found that the genetic correlations closely mirrored the factor loadings, which are based on phenotypic correlations. For example, factor 1 had r_g = −0.93 (SE = 0.01) with neuroticism and factor 3 was correlated with years of education (r_g = 0.79 (SE = 0.96) and income (r_g = 0.75, SE = 0.03).

Given these results, we hypothesized that our top phenome-wide factors partly capture nonspecific pathways contributing to shallow MDD phenotypes. To test this hypothesis, we performed GWAS analysis on a shallow MDD measure, GPpsy (n = 332,629), conditional on factor 1. This is akin to removing confounders such as batch effects in eQTL studies through conditioning on latent factors. We found that only 1 of the 25 GWAS hits for GPpsy remained after adjusting for factor 1 (Fig. 3f). This hit overlaps NEGR1 (top SNP rs1194283, odds ratio (OR) = 1.05, SE = 0.0065, P = 6.71 × 10⁻¹³; Fig. 3g), which has been identified as an MDD risk locus in multiple GWASs with varying phenotyping approaches^{4,6,1.2 million individuals highlight new therapeutic directions. Nat. Neurosci. 24, 954–963 (2021)." href="https://www.nature.com/articles/s41588-023-01559-9#ref-CR8" id="ref-link-section-d88530019e1827">8,35,36,37}. Intriguingly, this locus also has replicated associations with body mass index and obesity in multiple populations^38,39,40,41. We also performed GWASs conditioned on each of the other top ten factors, which generally had little impact (Supplementary Fig. 6). One clear exception, however, was that adjusting for factor 3 increased the number of GPpsy GWAS hits from 25 to 35. These additional loci could be false positives from collider bias^23,42,43.

MTAG improves GWAS power but is sensitive to inputs

As an alternative to phenotype imputation, we next evaluated phenotype integration at the GWAS summary statistic level using MTAG¹⁹, an inverse-covariance-weighted meta-analysis for GWAS on multiple traits. We did not use all 217 phenotypes in MTAG for two reasons. First, MTAG requires GWAS to be run on each input phenotype, which is computationally intractable for hundreds of phenotypes. Second, MTAG produces false positives when applied to large numbers of input GWAS¹⁹. Instead, we evaluated six different sets of input GWAS to MTAG, producing six different integrated LifetimeMDD GWASs (Fig. 4a, Extended Data Fig. 3, Supplementary Table 4 and Supplementary Note).

**Fig. 4: MTAG results for different choices of input phenotypes.**

All six MTAG input choices increased the number of GWAS hits from observed LifetimeMDD. On the low end, MTAG using family history measures of depression yielded five GWAS hits (MTAG.FamilyHistory, λ_GC = 1.20; Fig. 4b). On the high end, MTAG using shallow MDD phenotypes and environmental factors (such as recent stressful life events, lifetime traumatic experiences and the Townsend deprivation index) yielded 33 GWAS hits (MTAG.All, λ_GC = 1.45; Fig. 4c). We note that MTAG.AllDep is analogous to prior depression phenotypes defined by manually combining similar depression measures¹⁴ and that MTAG.FamilyHistory is analogous to prior methods that integrated family history into GWAS analysis^44,45. Of the 51 total hits across all MTAG runs, 34 overlapped hits from the imputed GWASs (Extended Data Fig. 3). Notably, we found that adding more input phenotypes in MTAG always increased the number of GWAS hits. This reflects a combination of added power by leveraging pleiotropy and an increased false positive rate¹⁹. Additionally, MTAG GWASs yielded substantially inflated heritability estimates (on both the liability and observed scales), which increased with more input GWAS. For example, MTAG.All gave \({h}_{{\rm{g}}({{\rm{liab}}})}^{2}\) = 84.9% (SE = 5.6%), compared to \({h}_{g({liab})}^{2}\) = 19.0% (SE = 2.9%) for observed LifetimeMDD (Fig. 4d and Supplementary Table 5).

We next examined genetic correlations between MTAG and other MDD GWASs (Fig. 4e). First, MTAG.All, which included the most input GWAS, clustered together with the imputed GWASs, which leveraged all 217 phenotypes. Second, MTAG using shallow MDD phenotypes (MTAG.AllDep and MTAG.GPpsy) clustered with the GWAS on GPpsy. Third, neuroticism was significantly more genetically correlated with MTAG.Envs (r_g = 0.84, SE = 0.01) than LifetimeMDD (\({r}_{g}\) = 0.66, SE = 0.06). These results are consistent with prior observations that MTAG-based summary statistics modestly inflate genetic correlation with the input phenotypes⁴⁶. Overall, the genetic correlations between MTAG and LifetimeMDD were high, with the lowest value derived for MTAG.Envs (r_g = 0.90, SE = 0.03).

Finally, to compare like to like, we evaluated SoftImpute's accuracy using only the MTAG.All input phenotypes (plus sex, age and 20 PCs). We found that imputation performed much worse with this reduced set of phenotypes (phenome-wide R² decreased from 59.6% to 39.5%, P &lt; 2 × 10⁻⁵, pooled t-test across folds; Supplementary Fig. 2 and Supplementary Table 2).

Phenotype imputation and MTAG improves PRS accuracy

We then assessed the within-sample prediction accuracy of PRSs built from integrated MDD GWAS. We used ten-fold cross-validation to obtain the Nagelkerke's \({R}^{2}\) prediction accuracy for LifetimeMDD in white British individuals in UK Biobank. We jointly cross-validated the phenotype imputation and PRS construction (Methods). For MTAG, we jointly cross-validated the GWASs on secondary input phenotypes. For context, we compared these PRSs to ones built from observed LifetimeMDD (n = 67,164; for n_effect see Supplementary Table 6) and GPpsy (n = 332,629) in UK Biobank¹¹, MDD defined by structured interviews in PGC29 (ref. ⁴) (n = 42,455), affective disorder defined by Danish health registries in iPSYCH¹⁵ (n = 38,123) and self-reported depression in 23andMe⁷ (n = 307,354; Supplementary Table 6).

We found that imputing LifetimeMDD doubled PRS prediction accuracy over observed LifetimeMDD (Fig. 5a; LifetimeMDD: R² = 1.0%, 95% confidence interval (CI) = 0.6–1.4%; Soft-ImpAll: R² = 2.1%, 95% CI = 1.3–2.9%; Auto-ImpAll: R² = 2.2%, 95% CI = 1.4–3.0%). Consistent with prior reports^11,12, we found that the GPpsy PRS predicted LifetimeMDD better than the LifetimeMDD PRS itself (R² = 1.6%, 95% CI = 0.6%–2.4%) because it has a roughly four times larger sample size. Nonetheless, both the SoftImpute and AutoComplete PRSs outperformed the GPpsy PRS, demonstrating that integrating shallow and deep phenotypes through imputation can improve PRS accuracy over that with either class of phenotypes alone. Finally, we found that the imputed LifetimeMDD PRS substantially outperformed the PRSs from iPSYCH (R² = 0.6%, 95% CI = 0.2–0.9%) and 23andMe (R² = 1.3%, 95% CI = 0.7–1.9%), even though iPSYCH used deeper phenotypes and 23andMe had a large sample size.

**Fig. 5: PRS performance using observed, imputed and/or meta-analyzed MDD.**

The performance of the MTAG PRSs mirrored the MTAG GWAS results and depended on the input phenotypes (Fig. 4 and Extended Data Fig. 3). First, MTAG.FamilyHistory showed improved accuracy compared to observed LifetimeMDD but underperformed in comparison to the imputed PRSs (R² = 1.5%, 95% CI = 0.6–2.5%; Fig. 5a). On the other hand, MTAG.All outperformed the imputed PRSs by about 20% (R² = 2.6%, 95% CI = [1.3%, 3.9%]; Fig. 5a). In particular, this demonstrates that MTAG with more than ten inputs, which is nonstandard and likely yields miscalibrated GWAS results, can significantly improve PRS prediction.

Phenotype imputation and MTAG improves PRS portability

We next asked whether phenotype integration improves PRS predictions in different cohorts, diagnostic systems and/or populations. Demonstrating portability is essential to establish that phenotype integration does not merely reflect dataset-specific biases.

First, we tested PRS accuracy in non-British European-ancestry individuals in UK Biobank (UKB.EUR, n = 10,166). These individuals were measured on the same LifetimeMDD phenotype as our sample of white British UK Biobank individuals and also have European ancestry, and hence represent the most similar cohort (Supplementary Note and Supplementary Fig. 7). Although the small sample size limits definitive conclusions, we observed a nearly identical pattern among the PRS methods as in our training sample: imputation and MTAG almost always improved prediction accuracy compared to both LifetimeMDD and GPpsy (Fig. 5b). We next assessed portability to two large European-ancestry cohorts from iPSYCH (2012 cohort (n = 42,250) and 2015i cohort (n = 23,351); Supplementary Note). These nonoverlapping samples were drawn from a nationwide Danish birth cohort with diagnoses obtained from national health registers^30,47. We again found qualitatively identical results, with imputation outperforming both LifetimeMDD and GPpsy and the best MTAG setting outperforming imputation (Fig. 5b). Finally, we tested portability to European-ancestry individuals in the ATLAS dataset based on MDD as defined in University of California - Los Angeles electronic health record (EHR) data^48,49 (ATLAS.EUR, n = 14,388; Supplementary Note, Supplementary Fig. 8 and Supplementary Tables 6–8). Again, the small sample size prevented definitive comparisons, but phenotype imputation and the best MTAG setting improved estimated accuracy (Fig. 5b).

We next tested these PRSs in individuals with non-European genetic ancestries, including African-ancestry individuals⁷ in UK Biobank with observed LifetimeMDD status (UKB.AFR, n = 687), as well as Han Chinese-ancestry individuals in the CONVERGE cohort^9,50 (n = 10,502; Supplementary Note) who were assessed for severe, recurrent MDD (Fig. 5c, Supplementary Note and Supplementary Table 6). Consistent with previous studies^51,52,53, we found that the PRSs we derived from GWASs on European-ancestry cohorts generally had poorer portability to non-European-ancestry cohorts, making firm conclusions difficult. Nonetheless, the best MTAG setting was always nearly optimal, and PRSs based on imputed LifetimeMDD always outperformed the PRS based on observed LifetimeMDD. Finally, we tested PRS prediction accuracy in UK Biobank individuals with Asian ancestry (UKB.ASN, n = 334), as well as ATLAS individuals who self-identified as Latino (ATLAS.LAT, n = 2,454), Black (ATLAS.AFR, n = 1,158) or Asian (ATLAS.ASN, n = 1,996). However, the power was too low in these small cohorts for meaningful interpretation (Supplementary Fig. 9).

A new metric for the specificity of PRSs

While phenotype integration improves PRS prediction in UK Biobank and in external cohorts, this may come at the cost of reduced specificity to MDD. This is because integration explicitly borrows information from secondary phenotypes, which could introduce genetic signals that are not specific to MDD. To quantify this spillover of nonspecific effects into an MDD PRS, we compared prediction accuracy for LifetimeMDD to prediction accuracy for secondary phenotypes. We call this metric of specificity 'PRS pleiotropy' (R²_secondary/R²_LifetimeMDD). Because core MDD biology is likely partly pleiotropic, its PRS pleiotropy should be nonzero for many secondary phenotypes. We further expect that shallow MDD phenotypes, such as GPpsy, would generally have higher PRS pleiotropy than LifetimeMDD¹¹.

For all PRSs based on observed and integrated MDD, we calculated PRS pleiotropy for 172 secondary phenotypes used in imputation. We then investigated the 62 secondary phenotypes that were significantly predicted by any examined PRS (P &lt; 0.05/172; Methods). Visualizing PRS pleiotropy for observed LifetimeMDD across this depression phenome showed a spectrum of highly linked traits, including shallow MDD phenotypes such as GPpsy and genetically correlated traits such as neuroticism, that quickly faded across successive phenotypes (Fig. 6a). By comparison, GPpsy broadly had higher PRS pleiotropy across secondary phenotypes, indicating that GPpsy captures less-specific biology than LifetimeMDD, as expected. We also found that the 23andMe GWAS had similar PRS pleiotropy to GPpsy, consistent with the fact that both measure MDD by self-reported depression.

**Fig. 6: Phenome-wide PRS pleiotropy quantifies nonspecificity.**

Our main question here was whether phenotype integration has more PRS pleiotropy than self-reported depression as in GPpsy and 23andMe. We first evaluated PRS pleiotropy for MTAG and found that specificity highly depended on the input GWAS (Fig. 6b). First, MTAG.Envs had far higher PRS pleiotropy than GPpsy, showing that its high PRS power comes at a high cost to specificity. On the other hand, MTAG.All had a similar PRS pleiotropy as GPpsy and nearly double the PRS R²; hence, MTAG.All is clearly superior to GPpsy. MTAG.FamilyHistory had the opposite properties: it only modestly improved PRS R² over observed LifetimeMDD, but this benefit came without loss of specificity. We then evaluated PRS pleiotropy for imputed phenotypes (Fig. 6c). The SoftImpute ImpAll and ImpOnly PRSs were both more specific to LifetimeMDD than the GPpsy PRS, which is notable given that the imputed values were constructed from more than 200 phenotypes, including GPpsy. The AutoComplete ImpOnly PRS was less specific than GPpsy, although the ImpAll PRS was comparable.

Finally, we evaluated PRS pleiotropy relative to observed LifetimeMDD to ask which of the phenotypes had excess pleiotropy in the integrated PRSs. We defined 'excess PRS pleiotropy' as the PRS pleiotropy minus the PRS pleiotropy of observed LifetimeMDD¹¹. As expected, self-reported depression PRSs (GPpsy and 23andMe) had high excess PRS pleiotropy, especially when compared to shallow MDD measures (Extended Data Fig. 4a). Likewise, MTAG.Envs had substantial excess PRS pleiotropy, especially for socioeconomic measures such as years of education (Extended Data Fig. 4b). Notably, MTAG.FamilyHistory had far less excess PRS pleiotropy than other MTAG settings or GPpsy, as well as reduced pleiotropy for several socioeconomic measures. Finally, SoftImp-All had lower excess PRS pleiotropy than GPpsy (41/62 phenotypes); however, AutoImp-All had higher excess PRS pleiotropy (Extended Data Fig. 4c). Overall, MTAG can outperform imputation in PRS sensitivity or specificity depending on input, while imputation provides a simple and scalable approach that performs well in both power and specificity.

We then downsampled each GWAS used to build a PRS to assess the impact of sample size (Supplementary Note). Overall, we found that PRS pleiotropy was stable, although it can be upwardly biased for sample sizes below 100,000 (due to low power). In particular, our results were robust to differences in the training PRS sample sizes, except that observed LifetimeMDD PRS pleiotropy is a conservative baseline because it is trained on 67,000 individuals (Extended Data Fig. 5 and Supplementary Figs. 10 and 11). Further, we confirmed that these results persisted when we used the same SNPs in each PRS (Supplementary Note and Extended Data Fig. 6).

Discussion

In this paper, we address the power–specificity tradeoff between deep and shallow MDD phenotypes by integrating them using phenotype imputation or MTAG. We show that the integrated MDD phenotypes greatly improve GWAS power and PRS accuracy while, crucially, preserving the genetic architecture of MDD. We propose a novel metric to assess the disorder specificity of a PRS that is widely applicable to biobank-based GWASs. This metric characterizes a power–specificity tradeoff. For MTAG, adding more phenotypes increases power but generally sacrifices specificity. Imputing LifetimeMDD with SoftImpute, on the other hand, better preserves specificity. Overall, both approaches to phenotype integration outperform either deep or shallow phenotypes alone.

Phenotype imputation is a simple and scalable approach that should be considered for most biobank-based genetic studies. An important but challenging future application will be longitudinal data, which are often sporadically measured across time and individuals with potentially extreme nonrandom missingness. One limitation of our specific imputation approaches is that they distort higher-order moments (Supplementary Fig. 1), which will bias some downstream analyses such as genetic correlation. In future work, this could be addressed with multiple imputation⁵⁴ or with heteroskedasticity-aware downstream tests⁵⁵.

MTAG has several important strengths that complement imputation. First, it operates on summary statistics, which enables incorporating external GWASs and is computationally cheap once GWASs have been performed. Second, we found that MTAG.All generally outperforms imputation in GWAS hits and PRS power and portability. However, MTAG generally has less specificity. A striking exception is MTAG.FamilyHistory, consistent with prior methods that specifically leverage family history to improve genetic studies^44,45,56. However, MTAG is highly sensitive to input phenotype selection and therefore requires extensive domain knowledge, similar to other work combining multiple depression measures¹⁴. There are several extensions to further improve MTAG for phenotype integration. First, we could incorporate local estimates of genetic correlation in MTAG, which has improved LifetimeMDD PRSs in UK Biobank⁵⁷. Second, we could directly combine PRSs for multiple traits using weights to optimize prediction^58,59,60. Third, we could develop a more systematic approach to choosing MTAG inputs, but this search is limited by the computational cost of performing cross-validated GWASs on each considered trait. Fourth, parametric models of confounding in summary statistics, such as GWAS-by-subtraction⁶¹, could improve specificity. However, these models rely on choosing appropriate inputs and causal models, which is not straightforward for heterogeneous disorders. Finally, we could input the GWASs on imputed phenotypes to MTAG; however, this may exacerbate their biases because imputation inflates the correlations between traits that MTAG leverages (Supplementary Fig. 1).

Our study has implications for improving disorder specificity in future MDD studies. We have worked on the deepest MDD phenotype in UK Biobank, LifetimeMDD, which is derived by applying DSM-5 criteria in silico to self-rated MDD symptoms in the MHQ. Although it lies on essentially the same genetic liability continuum as gold-standard MDD¹¹, LifetimeMDD is shallow compared to a clinical diagnosis based on a structured in-person interview due to self-report biases and misdiagnoses^62,63,64,65. In the future, this bias could be mitigated using probability weights, which have recently been used for GWASs^66,67. More broadly, the DSM-5 criteria for MDD have substantial shortcomings in reliability^68,69,70. Nonetheless, improving the MDD diagnostic criteria may be only achievable ...

Search This Blog

WordWD