A robust estimation of mosaic loss of chromosome Y from genotype-array-intensity data to improve disease risk associations and transcriptional effects

Accurate protocols and methods to robustly detect the mosaic loss of chromosome Y (mLOY) are needed given its reported role in cancer, several age-related disorders and overall male mortality. Intensity SNP-array data have been used to infer mLOY status but there are discrepancies of reported findings due to the calling uncertainty of the methods and the differences in the tissue-matrix used. We proposed MADloy, a method that optimizes mLOY calling by correctly modeling the underlying reference population with no-mLOY status and incorporating B-deviation information. MADloy increased the calling accuracy with respect to previous methods in experimentally validated samples and improved the statistical power to detect associations with disease and mortality in simulation studies and real dataset analyses. Using MADloy in 18 individuals, we observed that mLOY cellularity slightly increased in blood after 3 years but was detectable in only 41% samples (7/17) in saliva. We found extreme down-regulation of the entire chromosome Y in kidney and bladder tumors with mLOY, while mLOY in blood was associated with deregulation of DNA replication, repair and recombination pathways. MADloy will help to define the mechanisms and roles of mLOY in different tissues as a male-specific marker of disease for personalized medicine.


Introduction
The most common somatic mutation in humans is the mosaic loss of chromosome Y (mLOY) in men. Recent evidence suggests the important role of mLOY in numerous diseases, being a biological factor that contributes to overall male mortality (1,2) and, therefore, is likely to play an important role in male-specific treatments of disease. In particular, mLOY in blood cells increases with age, and is associated with smoking and with the risk of several age-related disorders, including hematological and non-hematological cancers, macular degeneration (3), Alzheimer's disease (4), major cardiovascular events (5) and suicidal behaviors (6). Despite the accumulating evidence, the mechanisms that trigger mLOY and its clinical consequences are still poorly understood. While susceptibility loci and epigenetic marks for mLOY have been identified (7,8), larger and more accurate association and functional studies are required; in particular, as some inconsistencies have been observed (9,10). Therefore, current mLOY calling methods and protocols need to be improved to confirm disease risks and mechanisms.
Here, we addressed two main issues that have been proposed to affect mLOY reproducibility, the optimal processing of mLOY calling from SNP array intensity data and the accuracy of detecting mLOY on different tissue-matrices.
A main source of discrepancy can be due to the differences in the methods used to call mLOY status from SNP array intensity data (9,11). The main signal used for mLOY calling is the log-R-Ratio (LRR) of the SNPs in chromosome Y. As a relative measure of the DNA content of a subject at a genomic locus with respect to a group of individuals, men with mLOY are expected to show low SNP-LRR values across the 56-Mb male-specific region of chromosome Y, which excludes the homologous region between chromosomes X-Y, pseudoautosomal (PAR1 and PAR2) and X-Y transposed (XTR), (12). Computing the mean LLR (mLRR-Y) in the region for each individual, the mLRR-Y thresh method called mLOY status on those individuals with mLRR-Y lower than the 99% confidence interval of experimentally induced mLRR-Y variation of normal individuals (13). As a threshold dependent method, mLRR-Y thresh is sensitive on the characterization of subjects with no-mLOY. Assuming that gains of chromosome Y (GOY) are rare, the positive side of the mLRR-Y distribution (centered at the peak) can be used to identify mLOY outliers by reflecting it to the negative mLRR-Y values to define the calling threshold.
However, while GOY events are less frequent than LOY events, GOY events can also be relevant particularly in tumor tissues. Furthermore, the presence of few of them is enough to affect the mLOY threshold; in addition, plausible deviations from symmetry in the mLRR-Y distribution also affect the position of the threshold. These two uncontrolled effects can introduce misclassification in mLOY calling reducing the power to find positive associations (14). Alleviation for mLOY misclassification in the association analyses can be considered if mLRR-Y is taken as a quantitative continuous variable (mLRR-Y quant ) (11). However, the approach is still suboptimal because the underlying distribution is a clear mixture of individuals with gains, losses or no changes in chromosome Y genetic content, a similar situation found in copy number variation calling (15,16). Therefore, improvements on the use of mLRR-Y to call mLOY status should include the robust identification of mLOY events as outliers from an mLRR-Y non-symmetric distribution. In addition, current methods have not used the B-allele frequency (BAF) signal in the homologous region between chromosomes X-Y as an independent signal to confirm findings. The deviation of BAF signal in heterozygous probes from its expected value, namely B-deviation, has been robustly used to detect different types of chromosomal events in mosaicism and compute their cellularity levels (Rodriguez-Santiago, 2010). We have therefore developed MADloy, a method that integrates robust outlier identification of mLRR-Y values with the use of B-deviation signal to improve mLOY calling. We show the improved performance of the method against current approaches using simulations and real data sets. MADloy has been implemented as a Bioconductor package for robust mLOY calling along with visualization functions.
Differences of mLOY detection between blood and buccal smear have also been proposed as a possible source of discrepancy between two large studies that estimated the mortality ratios associated to mLOY (9,11). To address this issue, we studied 18 individuals with positive mLOY status detected with MADloy in blood at base-line and assessed the progression of mLOY cellularity in both blood and saliva in a follow-up visit at 3 years, to determine the extent to which the tissue matrices are comparable for mLOY detection. We finally illustrate the application of MADloy in the study of the transcriptional correlates of mLOY in blood and in cancer.

mLOY calling
The reference methods for calling mLOY status using genotype-array intensity are those described in (1) and (8). Forsberg et al. (13) proposed to analyze the log R ratio (LRR) values of SNPs probes in the male-specific region of chromosome Y (mLRR-Y) in the 56-Mb region between PAR1 and PAR2 on chromosome Y (chrY:2,694,521-59,034,049; hg19/GRCh37). An important consideration is the XTR that is shared between X and Y chromosome is also removed from the analysis because XTR can be affected by alterations in chromosome X, then redefining the mLRR-Y region as: chrY 6,611,498-24,510,581; hg19/GRCh37.
Assuming that mLRR values follow a symmetrical distribution across subjects, individuals are LOY-scored based on the threshold that is defined as the lower limit of the 99% confidence interval of the experimentally induced mLRR-Y variation. Forsberg et al. proposed to model the symmetrical distribution by reflecting the positive values of mLRR-Y over its median (method mLRR-Y thres ). LRR is computed as the ratio between two intensities and therefore its distribution is likely to be skewed, in addition the presence of large mLRR-Y values representing XYY gains cannot be fully discarded. Wright et al. (8), therefore proposed to use mLRR-Y as a continuous variable to measure the degree of mLOY given by its cellularity content (method mLRR-Y quant ). However, it is clear that mLRR-Y is a multimodal distribution and ignoring this feature reduces the power in association studies, as it has been seen when copy number variant status is estimated using continuous intensities.
As an alternative threshold method we proposed to use a robust estimation of the limit that determines a gain in chromosome Y (e.g XYY) that can be seen as an outlier of normal mLRR distribution. This value is estimated by median(mLRR) + 1.2 IQR(mLRR).
The three procedures MADloy based on LRR, mLRR-Y thres and mLRR-Y quant are implemented in the function MADloy of MADloy package.
To reduce the number false-positive calls, we propose to use combine the information obtained of mLRR-Y with the B-deviation values in X-Y homologous regions which include pseudoautosomal regions PAR1, PAR2, and XTR regions (12) whenever this data is available.
The reason to use B-deviation in addition to LRR is because chromosome Y alterations also alter B-deviation values in the X-Y homologous regions, due to the allelic imbalance between chromosomes X and Y. By setting a B-deviation threshold of 0.05 (we assume an expected BAF of heterozygous probes between 0.45 and 0.55) we can identify whenever the BAF is altered.
This information can be used as a quality control criteria by removing from the analysis those samples classified as LOY or GOY using mLRR-Y with normal values of B-deviation who are indicated as "discordant". This procedure is implemented in the function checkBdev of MADloy package. In addition, those samples without alterations in the mLRR-Y values but with alterations in B-deviation classified as "other", can provide evidences that other types of events can occur in the analyzed sample. We recommend to visually inspecting samples classified as "discordant" or "other", and only remove those cases having contaminations or other non-LOY events. To this end plots of chromosome X and Y are used as implemented in the functions plotIndSNPX and plotIndLRR functions.

Quality control
We perform a quality control of the samples involved in the analyses by removing those samples with large LRR variability to avoid additional variability likely due to technical artifacts.

Experimental validation
Ten random samples belonging to EGCUT cohort were validated using two Multiplex Ligation Probe-dependent Amplification (MLPA) panels, P070 (MRC-Holland Amsterdam, The Netherlands) covering all subtelomeric regions including the two pseudoautosomal regions (PAR1 and PAR2). Probes targeting Y chromosome, and a custom-made panel with probes for SRY and several autosomal loci, were used to assess the copy number status of chromosome Y with respect to the control loci (autosomal and X chromosome). The MLPA reactions were performed as previously described (17) with the some modifications for custom probe selection (18). We used the relative peak height (RPH) method recommended by MRC-Holland.

Statistical analyses
Association studies of mLOY with age and cancer performed in EGCUT and TCGA data was performed by using generalized linear models (Gaussian and binomial links, respectively).
Models with cancer in EGCUT data were adjusted for age to remove the effect of mLOY on aging. Transcriptome data belonging to EGCTU data were obtained using HTA 2.0 microarray.
Transcriptome of TCGA samples were obtained from RNAseq data available at RTCGA package.
Both analyses were performed using limma package and voom method was used to get continuous data from RNAseq experiments of TCGA (19). Models were adjusted for surrogate variables using sva package in order to control for possible experimental differences across samples (20). Enrichment analyses were carried out using GOstats Bioconductor library (21).

Simulation studies
We performed a number of simulations to determine the power of the association analyses considering mLOY status as continuous (Wright's method), as categorical (Forsberg's and MADloy methods). We considered two main scenarios to assess the power to detect significant associations with continuous (age) or categorical (case/control) outcomes. Data were simulated using functions from the CNVassoc package which is able to simulate gains and losses from continuous data (e.g. LRR) that can be considered as surrogate variable of LOY status in case-control and linear models (22). LRR continuous data was generated by using the mean and standard deviation of LRR in normal and LOY cases and the standard deviation of the outcome using the values observed from EGCUT data. Our aim was to mimic real situations when analyzing correlation between case-control status and mLOY or between age and mLOY. and all the individuals included in the analysis had a genotyping success rate above 95%.
Cryptic relatedness was tested with the PLINK v1.07 software. Only one of each detected relative pairs (up to second cousins) was randomly chosen for the detection of genetic mosaicism. Sample mix-ups were corrected using MixupMapper (24). All studies were performed in accordance with the ethical standards of the responsible committee on human experimentation, and with proper informed consent from all individuals tested. LRR and BAF were generated using GenomeStudio software.

Improved detection of mLOY status
We developed MADloy, a bioinformatics tool available as a Bioconductor package, which implements accurate mLOY calling of SNP intensity data using mLRR-Y and B-deviation were discarded due to large mLRR-Y variability), see Methods. Figure 1a shows the ploidycentered 5% trimmed mLRR-Y values for each subject with their respective calling status using MADloy. Figures 1b and 1c Figures 1b and 1c show the expected pattern of one sample with mLOY, for which mLRR-Y is lower than the ploidy value and a large BAF split is observed. The size of the split is proportional to the cellularity and is measured by B-deviation defined as the absolute difference between BAF values. The expected value of the B-deviation for heterozygous probes is 0.5 when no mLOY is present.
Using MADloy, we detected a total of 30 (4.3%) individuals with X-Y allelic imbalance consistent with decreased Y chromosome dosage (Figure 1a).  To increase the experimental support of MADloy calling, we validated by MLPA, the positive mLOY status, as detected by MADloy, of 10 randomly selected individuals from EGCUT. The results were fully concordant in all ten individuals. Figure 2A shows the LRR and BAF of probes across chromosomes Y and X for one individual. Figure 2B illustrates the validation of two mLOY cases (V32199 and V40611) with different cellularity (30% and 52%, respectively). We observed that for individuals with mLOY the relative peak heights (RPH) of Y probes were clearly reduced compared to the RPH of the X probe ( Figure 2B).
To address the differences in mLOY detection across different tissues, we asked the extent to which mLOY status in blood could be detected in buccal smear. 18 individuals from EGCUT, who were detected with mLOY using MADloy in blood, were randomly selected and recontacted after 3 years to assess the progression of mLOY in blood and evaluate its detection by MADloy in buccal smear ( Table 1). We observed that mLOY in blood persisted in time ( Table   1). Detected from BAF signals, the estimated proportion of cells with mLOY increased in 16 cases (88%) while it decreased in 2 (12%). In saliva, we discarded one individual for low quality of the sample and observed mLOY in only 7 individuals (41%). All cases of mLOY in saliva showed lower cellularity than mLOY in blood at followup. Using the same detection methods and procedures, we therefore confirmed that the detection power of mLOY in blood is substantially decreased if assessed in buccal smear.

mLOY improves the statistical power of association studies
Using a series of simulations we first compared MADloy performance with mLRR-Y thresh and mLRR-Y quant . The simulations were performed using an independent simulator for copy number variation, CNVassoc assuming that the distribution of normal and mLOY cases, with the proportions observed for in the EGCUT study, followed those observed for CNVs. We studied the power to detect true associations between mLOY and a quantitative trait. mLOY calling was performed with MADloy and Fosberg's methods and associations were also assessed with mLRR-Y as a quantitative variable (Supplementary Figure S2). We found that MADloy had the largest statistical power in all scenarios while mLRR-Y quant was the least powerful. In particular, we observed that an association of magnitude 2 per quantitative trait unit was detected with using MADloy with a power of 80% in 1,000 individuals while for mLRR-Y thresh a magnitude of >2.2 was required for the same power. We also observed similar results for dichotomous traits in case-control studies (Supplementary Figure S3). The differences in power were given by a lack of symmetry in the mLRR-Y distribution that resulted in misclassification by mLRR-Y thresh . In addition, the simulations confirmed that treating mLRR-Y as quantitative when the underlying distribution is a mixture is a suboptimal approach.
We then compared in real setting the associations with age and cancer risk with mLOY called with MADloy and mLRR-Y thresh in EGCUT. In this case we also aimed to evaluate the effect incorporating B-deviation signal in MADloy calling. We confirmed the association between mLOY and age. We found that mLOY called by MADloy with B-deviation showed the strongest and most significant association with age (beta=21.21, P=2.0x10 -10 ). However, when mLOY status was called using mLRR-Y information only, either by MADloy or by mLRR-Y thresh , the effect of association was lower ( Figure 3A). Note that the largest impact in the association was when considering mLRR-Y thresh method, suggesting again that the symmetry assumption is less accurate than the interquantile range to define the mLOY calling threshold. We  (Figure S4).
We then tested the association with cancer status. We observed again that strongest association was when mLOY was called with the full MADloy method (OR=10.6, P=4.65x10 -6 ).
Associations when mLOY was called with mLRR-Y thresh and were also significant but their Pvalues were higher in one order of magnitude (P=1.48x10 -5 and P=1.59x10 -5 , respectively), in line with previous results and giving further evidence that a more efficient use of the SNP intensity signals in mLOY calling can increment the power to detect associations with phenotypes ( Figure 3B).

Application of MADloy on finding transcriptomic signatures of mLOY
Differences in statistical power become crucial when studying the mechanisms underlying mLOY using genomic and transcriptomic data, as few misclassifications can lead to not detecting a relevant genome or transcriptome-wide association. We therefore apply MADloy to comparatively analyzed transcriptomic data in cells from individuals with and without mLOY in blood in the EGCUT study and in cancer tissues in the TCGA study.
We searched for transcription correlates with mLOY in blood using the EGCUT study. We

Discussion
Substantial differences in mLOY calling can arise from the different methods used leading to differences in power to detect an effect of mLOY on aging and cancer. We show here that significant improvements in the detection accuracy of mLOY status from SNP array intensity data can be obtained by using a common matrix-tissue and optimization of the calling methods, by improving the calling threshold in mLRR-Y signal and accounting individuals with chromosome Y gains. Previous methods assumed symmetry in the mLRR-Y distribution and included individuals with gains in the reference population, affecting the threshold to detect mLOY status (13). We observed that individuals with GOY are detectable. Therefore, not accounting for asymmetry in the distribution can increase mLOY false positives, reducing the statistical power in the associations due to misclassification. Interestingly we also observed that treating mLRR-Y as a continuous variable was more detrimental than positive (11). This is likely due to its multimodal nature of its distribution where individuals fall sharply into the normal or mLOY categories.
An additional gain in detection accuracy was observed by an independent analysis of the Bdeviation signal (18). While improvements were modest for the associations with age and cancer risk for mLOY in blood, we observed substantial increments in cancer tissues. The inclusion of this B-deviation as an additional signal in mLOY calling allowed us to detect instances where samples are likely contaminated and to confirm individuals with GOY. We have shown that samples classified as mLOY using mLRR-Y with discordant B-deviation values should be visually inspected to define contaminations or other non-LOY events. Removing these cases from the analyses in a quality control phase will reduce misclassification errors and improve statistical power.
It has been suggested that discrepancies in associations between mLOY and disease risk or mortality could also be related to the different tissue source of the DNA, despite significant correlation of mLOY calling in blood and buccal derived DNA from the same individuals (9). We show here that the cellularity of mLOY in blood tends to slightly increase with time, but the concordance with mLOY in buccal smear is only 41%, with lower cellularity in buccal samples in most cases. Therefore, different tissue source could lead to differences in mortality ratios and other associations to mLOY, as previously reported (9,11). On the other hand, the presence of mLOY across different tissues does suggest a common origin in each individual, consistent with a genetic predisposition that has also been documented by genome-wide association studies (8,7).
Interestingly, the transcriptomic associations of mLOY are quite different in tumor samples from non-tumor blood samples. In kidney and bladder tumors, extreme down-regulation of the entire Y is the main finding, a logical consequence of the decreased number of chromosome Y copies. However in blood, only two Y chromosome genes are among the most deregulated ones, and deregulation of autosomal genes mainly affects DNA replication, mismatch repair and homologous recombination pathways. This finding could be related to an increasingly oligoclonal leukocyte cell population in people with mLOY, which is consistent with the possibly compromised immune cell function of circulating leukocytes in people with mLOY as the risk factor for disease (13).
The mLOY is increasingly recognized as a male specific risk factor for multiple diseases and, therefore, its interest in personalized treatment and risk management is likely to increase in the coming years. Specific disease and mechanistic studies require a robust estimation of the mLOY status of individuals. SNP intensity data is already available in large epidemiological studies where the effects of environmental conditions can also be investigated. The ability to call mLOY reliably using these data will increase the reproducibility of the findings We show that MADloy is able to replicate the associations of mLOY in blood with cancer and aging while        Table S1: Discordant calls of EGCUT data sets. Data of some one case with discordant and other classification by MADloy are visually inspected in Figure  S3. Figure S1: Panels A and C show LRR and mLRR-Y of chromosome Y. Panels B and D show LRR and BAF of chromosome X of two samples from ECGUT dataset that were called as "discordant" (V39233) and "other" (V26411) MADloy. Sample V39233 (panels A and B) has been classified as "discordant" due to a decreased LRR (panel A, blue line) than expected (panel A, orange line) but no clear BAF split is visible in PAR1, PAR2 (panel C, Blue boxes) and XTR (panle C, orange box) in chromosome X. Sample V26411 (panels C and D) has been classified as "other" due to a decreased LRR(C, blue) than expected (panel C, orange line). However, by looking at chromosome X plot (panel D), it can be seen a 2-band BAF split caused by the presence of an XX sample contamination that will also decrease the expected chromosome Y dosage in the sample.  200, 300, 500, 750, 1000, 1500). The study assess the power (p-value<10 -5 ) to detect differences between cases an controls at different levels (odds ratios ranging from 1 up to 8). The scenarios and how data are simulated are described in the Methods Section of the main manuscript.