Chapter 2 A novel copy number variant algorithm
2.1 Introduction
In human genetics, individuals normally have two copies of each locus in the genome (one inherited from each parent). Deviations from the normal diploid state, known broadly as copy number variation, can cause phenotypic changes and Mendelian disorders. Technologies, e.g. microarray, exist for reliably detecting large (greater than 100 kilobases) copy number variants (CNVs). Over the last decade, the availability short-read DNA sequencing compelled numerous efforts to identify and characterize smaller variants. Sequencing cost, data burden, and the problem of classifying intronic and non-coding variants have led to exome sequencing (ES) as the preferred clinical sequencing modality. ES analysis most often focuses on identifying pathogenic single-nucleotide variants and insertion/deletions. CNV analysis has demonstrated limited improvement in diagnostic yield,85 but existing data/analysis lacks power to detect exon-level variation.86,87
Current analytic methodologies adequately detect large CNVs, but require large amounts of data and lack resolution for intragenic exon-level variation.46,88–90 The prevalence and clinical importance of exon-level CNVs remains largely unknown due to inadequate power in ES studies and limited access to clinical genome sequencing data. Recent work on a subset of 1507 genes suggests intragenic CNVs account for 1.9% of total variants, but 9.8% of pathogenic variants.91 Additionally, the authors demonstrated 627/2844 (22%) of identified CNVs spanned a single (598) or partial (29) exon.91
Targeted sequencing requires capturing the desired loci (e.g. exons) using sequence-specific oligonucleotide baits. The differential efficiency of baits, even when carefully designed and balanced, leads to variable read-depth across the exome. The GC content and length of targeted fragments both contribute to the observed variable read-depth;92 most ES analysis platforms incorporate correction for GC content and exon length.93 The variable read-depth in ES precludes the single-sample window-smoothing approaches successfully applied in GS data,94 therefore we must rely on comparative analysis for interrogating copy number.
Comparing multiple samples, each captured independently, compounds the variable read-depth problem. The capture probability for each exon correlates between samples but with high variability.46 In other words, we can gain information from similarly captured samples, but independent captures introduce significant noise. ExomeDepth attempts to circumvent the capture-to-capture variation by identifying a subset of samples from a large pool with low inter-sample variability.46 Alternatively, CoNIFER,90 XHMM,88 and CODEX89 use a latent factor model with spectral value decomposition to remove systematic noise, presumably introduced by capture-to-capture variation. These methods generally require very large sample sizes, and often still lack power for exon-level resolution (e.g. CODEX defines a “short” CNV as spanning 5 contiguous exons).
Here, we explore how multiplexing the capture across samples reduces inter-sample variance, increasing the power to detect CNVs. We also introduce our own algorithm, mcCNV (“multiplexed capture CNV”), specifically designed to utilize multiplexed capture exome data for estimating exon-level variation without prior information.
2.2 Methods
2.2.1 Exome sequencing
We performed sequencing on human samples of purified DNA obtained from the Wilhelmsen laboratory collection, the NCGENES cohort,95 and the Coriell Institute in compliance with the UNC Institutional Review Board. We also utilized existing read-level data from the NCGENES95 project. We compared the performance of two capture platforms: (1) Agilent SureSelect XT2 (multiplexed capture)/Agilent SureSelect XT (independent capture); (2) Integrated DNA Technologies (IDT) xGen Lockdown Probes. We utilized Human All Exome v4 baits (Agilent) and Exome Research Panel v1 baits (IDT). All captures performed according to manufacturer protocol, with the following exceptions: (1) we multiplexed 16 samples versus the recommended 8 for the XT2 protocol for some pools; (2) for Pool2, we performed the fragmentation step 5 times, to test whether a more uniform fragment length distribution would improve capture.
All sequencing performed with Illumina paired-end chemistry. We aligned paired reads to hg19v0 (GATK resource bundle) using BWA-MEM96 and removed duplicate reads using Picard tools. We then used our novel R package, mcCNV, to count the number of overlapping molecules (read-pairs) per exon. For inclusion, we required properly-paired molecules with unambiguous mapping for one read and mapping quality greater than or equal to 20 for both reads. Full Snakemake97 pipeline provided in supplemental materials. Table 2.1 provides an overview of the exome sequencing included.
2.2.2 Genome sequencing
For the 16 samples in the “WGS” pool, we performed genome sequencing to an average 50x coverage. We followed Trost et al. recommendations for making read-depth based CNV calls.98 Briefly, we mapped paired-reads identical to our targeted sequencing data. We then interrogated the read depth interquartile range using samtools depth,29 recalibrated base-quality scores and called sequence variants using GATK,40 and called copy number variants using the ERDS49 and cnvpytor (updated implementation of CNVnator)45 algorithms. Full Snakemake97 pipeline provided in supplemental materials.
2.2.3 Simulating targeted sequencing
To simulate targeted capture, we represent the capture process as a large multinomial distribution defining the probability of capture at each target. We use an alternate definition of copy state, such that 1 represents the normal diploid state. Let \(N\) represent the total number of molecules (read pairs) and \(e_j \in \mathbb{E}\) represent the probability of capturing target \(j\), then for each subject, \(i\):
Randomly select \(s_{ij} \in \mathbb{S}_i\) from \(S = \{0.0, 0.5, 1, 1.5, 2\}\) as the copy number at target \(j\)
Adjust the subject-specific capture probabilities by the copy number, \(\mathbb{E}_i = \frac{\mathbb{E} \odot \mathbb{S}_{i}}{\sum_j \mathbb{E} \odot \mathbb{S}_{i}}\)
Draw \(N\) times from \(\text{Multinomial}(\mathbb{E}_i)\), giving the molecule counts at each target \(j\) for sample \(i\), \(c_{ij} \in \mathbb{C}_i\)
We provide functionality within the mcCNV R package for producing reproducible simulations.
2.2.4 mcCNV algorithm
The mcCNV algorithm was adapted from the sSEQ method for quantifying differential expression in RNA-seq experiments with small sample sizes.99 Yu et al. provide detailed theoretical background of the negative binomial model and using shrinkage to improve dispersion estimates. The mcCNV algorithm adjusts the sSEQ probability model by adding a multiplier for the copy state: \[ C_{ij} \sim \mathcal{NB}(f_is_{ij}\hat\mu_j, \tilde\phi_j/f_i) \] where the random variable \(C_{ij}\) represents observed molecule counts for subject \(i\) at target \(j\), \(f_i\) is the size factor for subject \(i\), \(s_{ij}\) is the copy state, \(\mu_j\) is the expected mean under the diploid state at target \(j\), and \(\tilde\phi_j\) is the shrunken phi at target \(j\). We observe \(c_{ij}\) and wish to estimate \(s_{ij}\), \(\hat{s}_{ij}\). Initialize by setting \(\hat{s}_{ij} = 1\) for all \(i,j\). Then,
Adjust the observed values for the estimated copy-state, \[ c_{ij}^{\prime} = \frac{c_{ij}}{\hat{s}_{ij}}. \]
Subset \(c_{ij}^{\prime}\) such that \(c_{ij}^{\prime} > 10, ~ \hat{s}_{ij} > 0\)
Calculate the size-factor for each subject \[ f_i = \text{median}\left(\frac{c_{ij}^{\prime}}{g_j}\right), \] where \(g_j\) is the geometric mean at each exon.
Use method of moments to calculate the expected dispersion \[ \hat\phi_j = \max\left(0, \frac{\hat\sigma_j^2 - \hat{\mu}_j}{\hat{\mu}_j^2}\right) \] where \(\hat{\mu}_j\) and \(\hat{\sigma}_j^2\) are the sample mean and variance of \(c_{ij}^{\prime}/f_i\).
Let \(J\) represent the number of targets. Shrink the phi values to \[ \tilde\phi_j = (1 - \delta)\hat\phi_j + \delta\hat{\xi} \] such that \[ \delta = \frac{\sum\limits_j\left(\hat\phi_j - \frac{1}{n_j}\sum\limits_j \hat\phi_j\right)^2/(J - 1)} {\sum\limits_j\left(\hat\phi_j - \hat{\xi}\right)^2/(n_j - 2)} \] and \[ \hat{\xi} = \mathop{\text{argmin}}\limits_{\xi}\left\{ \frac{d}{d\xi}\frac{1}{\sum\limits_j \left(\hat\phi_j - \xi\right)^2} \right\}. \]
Update \(\hat{s}_{ij}\), \[ \mathop{\text{argmax}}\limits_{s \in S}\left\{ \mathcal{L}(s \rvert c_{ij},f_i,\hat\mu_j,\tilde\phi_j) \right\} \] where \(S = \{0.001, 0.5, 1, 1.5, 2\}\).
Repeat until the number of changed states falls below a threshold or a maximum number of iterations is reached.
After convergence, calculate p-values for the diploid state, \(\pi_{ij} = \text{Pr}(s_{ij} = 1)\).
Adjust p-values using the Benjamini–Hochberg procedure100 and filter to a final call-set such that adjusted p-values fall below some threshold, \(\alpha\).
2.3 Results
2.3.1 Multiplexed capture reduces inter-sample variance
ES requires using molecular baits to “capture” the exonic DNA fragments during the library preparation (prior to sequencing). Most laboratories capture each sample individually. The capture efficiency varies with timing, temperature, and substrate concentrations, making identical capture reproduction impossible. Alternatively, one could multiplex (pool) samples prior to capture, capturing the pool of samples simultaneously. Here we profile the inter-sample variance of individual capture versus multiplexed capture.
A multinomial process provides a logical framework for modeling targeted capture, each target represented by an individual outcome. We can estimate the multinomial probability simplex for an exome capture by dividing the observed counts at each exon by the total mapped reads for the exome. The dirichlet distribution, conjugate prior to the multinomial, defines distributions of probability simplexes. The dirichlet distribution is parameterized by \(\boldsymbol{\alpha} = \{\alpha_1, \alpha_2, \dots, \alpha_n\}\), where the expected probability for outcome \(i\) is given by \(\alpha_i/\alpha_0,~\alpha_0 = \sum \boldsymbol\alpha\). If \(\boldsymbol\pi\) is a probability simplex drawn from a dirichlet with \(\boldsymbol\alpha\), then the variance of \(\boldsymbol\pi\) is inversely proportional to \(\alpha_0\). Therefore, we can approximate the inter-sample variance by fitting the dirichlet distribution to each pool and interrogating the mean \(\alpha\).
pool | capture | N | medExon | medTotal | minTotal | maxTotal | rsdTotal |
---|---|---|---|---|---|---|---|
IDT-IC\(^\dagger\) | IDT | 16 | 143 | 55,149,058 | 37,453,015 | 85,138,915 | 22.4 |
IDT-MC | IDT | 16 | 93 | 29,772,684 | 16,674,468 | 118,147,912 | 64.2 |
IDT-RR | IDT | 16 | 272 | 79,079,629 | 61,289,322 | 120,147,888 | 22.9 |
NCGENES\(^\dagger\) | Agilent | 112 | 93 | 24,451,245 | 12,749,793 | 68,565,471 | 27.6 |
Pool1 | Agilent | 16 | 56 | 13,265,614 | 8,911,132 | 17,324,903 | 18.5 |
Pool2 | Agilent | 16 | 86 | 21,076,056 | 4,585,195 | 27,846,146 | 27.6 |
SMA1 | Agilent | 8 | 56 | 12,256,002 | 11,051,840 | 13,600,697 | 6.2 |
SMA2 | Agilent | 8 | 25 | 5,622,040 | 4,904,000 | 6,545,360 | 10.4 |
WGS | Agilent | 16 | 196 | 46,406,224 | 36,496,097 | 65,200,410 | 16.4 |
Using multiplexed capture, we sequenced 3 16-sample pools and 2 8-sample pools with Agilent baits and 2 16-sample pools with IDT baits (Table 2.1). To compare to individually-captured Agilent data, we randomly-selected 5 16-sample pools from the NCGENES cohort. For numeric stability, we subset to exons with at least 5 and no greater than 2000 counts across all samples within a pool. We then used a Newton-Raphson algorithm101 to fit the dirichlet distribution to each pool; all pools converged to stable estimates. We found, with one exception, multiplexed capture pools had greater \(\alpha_0\) of their independently-captured counterparts (Figure 2.1).

Figure 2.1: Multiplexed capture (MC) decreases variance with respect to independent captures (IC), as estimated by fitting the dirichlet distribution. Total counts/sample given on the horizontal axis; mean \(\alpha\) given on the vertical axis. \(\alpha_0\) is inversely proportional to inter-sample variance. Each line/point represents a single pool. The point indicates the median total counts across the pool, with the range given by the line. Orange indicates a multiplexed capture; blue indicates independent captures. Triangles indicate pools using Agilent (AGL) capture; squares indicate Integrated DNA Technologies (IDT).
The multiplexed pool without decreased inter-sample variance, IDT-MC, had a much larger spread in sequencing depth across the pool (Table 2.1, Figure 2.1). Looking at the total mapped molecules, the IDT-MC pool had a relative standard deviation of 64.2%, over double the next highest pool. We hypothesized the absent reduction in variation stemmed from poor library balance during the multiplexing step. We subsequently captured a new pool using the same DNA input, IDT-RR, and found comparable reductions in inter-sample variance (the pool with the highest \(\alpha_0\) in Figure 2.1).
Examining the mean-variance relationship demonstrated the same inter-sample variance reduction suggested by the dirichlet parameter estimates (Figures 2.2 and 2.3). The Agilent pools (Figure 2.2) segregated cleanly, with less dispersion in the multiplexed capture pools. Again, we found no variance reduction for the IDT-MC pool, overlapping with independently-captured IDT-IC pool (Figure 2.2). We did, however, observe near-complete reduction in dispersion for the better-balanced IDT-RR pool.

Figure 2.2: Mean-variance relationship for Agilent (AGL) pools. Mean counts per exon given on the horizontal axis; mean variance per exon given on the vertical axis. Contours show the distribution of points by pool. Dotted lines show the ordinary least squares regression fit. Orange indicates multiplexed capture pools; blue indicates independently captured pools. The dashed gray line represents the 1:1 relationship expected under a Poisson process. Lines above the plot show the density of mean values by pool; lines to the right of the plot show the density of variance values by pool.

Figure 2.3: Mean-variance relationship for Integrated DNA Technologies (IDT) pools. Mean counts per exon given on the horizontal axis; mean variance per exon given on the vertical axis. Contours show the distribution of points by pool. Dotted lines show the ordinary least squares regression fit. Orange indicates multiplexed capture pools; blue indicates independently captured pools. The dashed gray line represents the 1:1 relationship expected under a Poisson process. Lines above the plot show the density of mean values by pool; lines to the right of the plot show the density of variance values by pool.

Figure 2.4: Comparison of mean-variance relationship between WGS pool (blue) and IDT-RR pool (orange). Mean count by exon given on horizontal axis; variance of exon counts given on horizontal axis. Dotted lines show the ordinary least-squares fit. Lines above plot show the distribution of mean values; lines to the right of the plot show the distribution of variance values.
2.3.2 Multiplexed capture provides controls for ExomeDepth
ExomeDepth requires a set of control subjects, summed into a reference vector of counts at each exon. ExomeDepth provides functionality to select appropriate controls from a set of subjects, often requiring large numbers of subjects to identify appropriate controls. Smaller research groups and clinical laboratories may struggle building large databases of exomes, with the difficulty compounded by lot-to-lot variation and regular improvements to capture and sequencing chemistries. We wanted to know if the reduced inter-sample variance with multiplexed capture could provide an appropriate control set for ExomeDepth, eliminating the need for large databases of similarly-captured exomes. We found the reduced inter-sample variance with multiplexed capture leads to appropriate control selection for ExomeDepth (Figures 2.5). Pool2, where we repeated the initial fragmentation 5 times, did not perform as well as the other multiplexed pools. We also found two samples within the WGS pool did not correlate well with the rest of the pool.

Figure 2.5: Median count per exon. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Figure 2.6: Total number of controls selected by ExomeDepth. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Figure 2.7: Proportion of available samples selected by ExomeDepth as a control. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Figure 2.8: Estimated phi parameter from ExomeDepth. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.
When we looked at independently-captured subjects, we found appropriate control sets for most of the 112 NCGENES subjects (Figure 2.6). However, ExomeDepth only selected 12.2% of available samples as controls, on average (Figure 2.7). Similarly, with the independently-captured IDT-IC pool we find low control numbers for most samples. While possible to select the same number of controls but exhibit differing dispersion, we observed little difference in the dispersion between independent and multiplexed capture (Figure 2.8). Overall, multiplexed capture provided appropriate controls for most samples tested, however an adequately-large set of available controls delivered comparable performance.
2.3.3 mcCNV & ExomeDepth perform comparably in simulation study
To compare our mcCNV algorithm and ExomeDepth, we created synthetic pools of data across different sequencing depths. Based on our observations with the real data, we selected the total number of molecules for each sample from a uniform distribution defined as a 30% window on either side of the specified depth; for example, for a specified depth of 10 million molecules, we drew the molecules per sample from 7 to 13 million molecules. For each depth ranging from 5 to 100 million molecules, we simulated 200 16-sample pools with single-exon variants. We allowed for homozygous and heterozygous deletions and duplications (0 to 4 copies), such that all variants were equally likely and the total variant probability was \(1/1,000\). We used, as the starting capture probabilities (\(\mathbb{E}\)), the empiric capture probabilities observed by summing across the Pool1 pool.
We analyzed each of the 4,000 pools (200 replicates by 20 depths) using our algorithm and two iterations of ExomeDepth. For the first iteration of ExomeDepth, we used the default values for transition probability (\(1/10,000\)) and expected variant length (50 kb). For the second iteration, we used the true simulated variant prior for the transition probability (\(1/1,000\)) and an expected variant length of 1 kb. As expected, the sensitivity increased and false discovery rate decreased as the sequencing depth increased (Figures 2.9). In both comparisons, mcCNV demonstrated a superior false-discovery rate. When interrogating Matthew’s correlation coefficient102 and the sensitivity, we found mcCNV had marginal performance over ExomeDepth with default parameters and marginal performance under ExomeDepth with simulation-matched parameters. Table 2.2 provides the actual values.






Figure 2.9: Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, \(1/10,000\) transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, \(1/1,000\) transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.
MCC
|
TPR
|
FDR
|
|||||||
---|---|---|---|---|---|---|---|---|---|
dep | mcCNV | ED-def | ED-sim | mcCNV | ED-def | ED-sim | mcCNV | ED-def | ED-sim |
5 | 0.713 | 0.401 | 0.519 | 0.522 | 0.192 | 0.298 | 0.02230 | 0.15900 | 0.09260 |
10 | 0.694 | 0.628 | 0.708 | 0.503 | 0.431 | 0.549 | 0.04250 | 0.08450 | 0.08590 |
15 | 0.781 | 0.742 | 0.801 | 0.627 | 0.581 | 0.690 | 0.02600 | 0.05270 | 0.06940 |
20 | 0.840 | 0.811 | 0.857 | 0.719 | 0.682 | 0.777 | 0.01810 | 0.03420 | 0.05360 |
25 | 0.879 | 0.856 | 0.893 | 0.783 | 0.752 | 0.832 | 0.01310 | 0.02460 | 0.04090 |
30 | 0.907 | 0.889 | 0.918 | 0.831 | 0.804 | 0.872 | 0.00967 | 0.01750 | 0.03210 |
35 | 0.926 | 0.909 | 0.935 | 0.864 | 0.839 | 0.897 | 0.00807 | 0.01370 | 0.02600 |
40 | 0.941 | 0.927 | 0.948 | 0.891 | 0.869 | 0.917 | 0.00638 | 0.01060 | 0.02080 |
45 | 0.952 | 0.940 | 0.957 | 0.911 | 0.892 | 0.932 | 0.00527 | 0.00846 | 0.01680 |
50 | 0.961 | 0.950 | 0.965 | 0.927 | 0.910 | 0.944 | 0.00437 | 0.00701 | 0.01370 |
55 | 0.966 | 0.957 | 0.969 | 0.937 | 0.921 | 0.951 | 0.00377 | 0.00569 | 0.01180 |
60 | 0.972 | 0.963 | 0.974 | 0.947 | 0.933 | 0.959 | 0.00318 | 0.00517 | 0.00986 |
65 | 0.976 | 0.969 | 0.978 | 0.955 | 0.943 | 0.964 | 0.00290 | 0.00433 | 0.00837 |
70 | 0.978 | 0.972 | 0.980 | 0.960 | 0.949 | 0.968 | 0.00252 | 0.00381 | 0.00735 |
75 | 0.981 | 0.976 | 0.983 | 0.965 | 0.955 | 0.972 | 0.00212 | 0.00321 | 0.00625 |
80 | 0.983 | 0.978 | 0.985 | 0.969 | 0.960 | 0.975 | 0.00200 | 0.00294 | 0.00560 |
85 | 0.985 | 0.980 | 0.986 | 0.972 | 0.963 | 0.977 | 0.00181 | 0.00263 | 0.00491 |
90 | 0.987 | 0.982 | 0.987 | 0.975 | 0.967 | 0.979 | 0.00169 | 0.00243 | 0.00451 |
95 | 0.988 | 0.984 | 0.988 | 0.978 | 0.970 | 0.981 | 0.00156 | 0.00223 | 0.00393 |
100 | 0.989 | 0.985 | 0.989 | 0.980 | 0.973 | 0.982 | 0.00150 | 0.00195 | 0.00359 |
2.3.4 mcCNV & ExomeDepth perform comparably on WGS pool
To establish a truth set on real data, we performed matched genome sequencing on the subjects included in the WGS pool. Following the best practices suggested by Trost et al.,98 we performed read-depth based CNV calling using the genome data. In line with recommendations by Trost et al., we excluded from comparative analysis any exons overlapping repetitive or low-complexity regions (34,856 out of 179,250). We then compared the exome calls using mcCNV and ExomeDepth to the genome calls using the overlap of ERDS and cnvpytor. Table 2.3 lists the total calls by subject. Overall, mcCNV predicted the largest number of variants; however, 85.7% of predicted variants were deletions from two samples (NCG_00790 and NCG_00851). ExomeDepth also predicted a disproportionate number of deletions for NCG_00790 and NCG_00851, totaling 69.4% of calls.
Total
|
Duplications
|
Deletions
|
|||||||
---|---|---|---|---|---|---|---|---|---|
subject | MC | ED | WG | MC | ED | WG | MC | ED | WG |
NCG_00012 | 90 | 106 | 143 | 61 | 73 | 121 | 29 | 33 | 22 |
NCG_00237 | 82 | 101 | 165 | 50 | 64 | 129 | 32 | 37 | 36 |
NCG_00525 | 68 | 74 | 151 | 30 | 33 | 110 | 38 | 41 | 41 |
NCG_00593 | 45 | 58 | 142 | 22 | 28 | 81 | 23 | 30 | 61 |
NCG_00676 | 66 | 78 | 112 | 38 | 46 | 92 | 28 | 32 | 20 |
NCG_00790 | 5,156 | 2,204 | 121 | 19 | 37 | 92 | 5,137 | 2,167 | 29 |
NCG_00819 | 68 | 76 | 134 | 30 | 41 | 100 | 38 | 35 | 34 |
NCG_00840 | 78 | 92 | 157 | 44 | 52 | 115 | 34 | 40 | 42 |
NCG_00851 | 1,151 | 859 | 141 | 28 | 51 | 102 | 1,123 | 808 | 39 |
NCG_00857 | 59 | 75 | 119 | 10 | 15 | 81 | 49 | 60 | 38 |
NCG_00976 | 46 | 58 | 114 | 25 | 37 | 93 | 21 | 21 | 21 |
NCG_01023 | 59 | 95 | 143 | 32 | 60 | 113 | 27 | 35 | 30 |
NCG_01043 | 73 | 94 | 128 | 40 | 64 | 105 | 33 | 30 | 23 |
NCG_01076 | 36 | 57 | 105 | 7 | 22 | 78 | 29 | 35 | 27 |
NCG_01077 | 135 | 157 | 230 | 103 | 121 | 184 | 32 | 36 | 46 |
NCG_01117 | 95 | 101 | 154 | 72 | 78 | 129 | 23 | 23 | 25 |
Looking at the control selection, for NCG_00790 and NCG_00851 ExomeDepth only selected 2 and 3 controls, respectively. Furthermore, NCG_00790 and NCG_00851 had substantially higher dispersion than the rest of the pool (two outliers in Figure 2.8).
Recognizing the genome calls do not represent an accurate truth set, we looked at the ability of mcCNV and ExomeDepth to predict the genome calls. Due to the large number of deletions called for NCG_00790 and NCG_00851, both algorithms performed poorly in predicting the genome calls (Table 2.4). When we excluded NCG_00790 and NCG_00851 from the analysis, mcCNV had comparable, but uniformly better performance. Both algorithms demonstrated greater power to detect deletions. Figures 2.10 and 2.11 show the call overlap between the three approaches, including and excluding NCG_00790 and NCG_00851, respectively.
MCC | TPR | FDR | PPV | |||
---|---|---|---|---|---|---|
ALL | Total | MC | 0.185 | 0.335 | 0.897 | 0.1030 |
ED | 0.263 | 0.363 | 0.809 | 0.1910 | ||
Sub | MC | 0.487 | 0.345 | 0.311 | 0.6890 | |
ED | 0.482 | 0.378 | 0.383 | 0.6170 | ||
DUP | Total | MC | 0.396 | 0.236 | 0.334 | 0.6660 |
ED | 0.347 | 0.240 | 0.496 | 0.5040 | ||
Sub | MC | 0.404 | 0.246 | 0.333 | 0.6670 | |
ED | 0.384 | 0.266 | 0.446 | 0.5540 | ||
DEL | Total | MC | 0.180 | 0.639 | 0.949 | 0.0509 |
ED | 0.219 | 0.558 | 0.914 | 0.0861 | ||
Sub | MC | 0.683 | 0.661 | 0.294 | 0.7060 | |
ED | 0.541 | 0.554 | 0.471 | 0.5290 |


Figure 2.10: Copy number variant call concordance for the WGS pool. (A) predicted duplications; (B) predicted deletions. mcCNV in grey; ExomeDepth in blue; ERDS/cnvpytor in orange. Values within overlaps give the number of variants.


Figure 2.11: Copy number variant call concordance for the WGS pool, excluding subjects NCG_00790 and NCG_00851 due to poor correlation to the rest of the pool. (A) predicted duplications; (B) predicted deletions. mcCNV in grey; ExomeDepth in blue; ERDS/cnvpytor in orange. Values within overlaps give the number of variants.
2.4 Discussion
The medical genetics community still lacks robust exome-wide information about small (exon-level) variant prevalence. Others have established the reliability and cost-efficiency of pre-capture multiplexing,103–107 and most commercial exome capture platforms have protocols for pre-capture multiplexing. Here, we demonstrate the reduction in inter-sample variance when pre-capture multiplexing, leading to increased power to detect exon-level copy number variation. Despite the benefits, many clinical laboratories do not employ a multiplexed capture protocol because multiplexing requires waiting to fill a pool and may delay results. While we understand the increased complexity, multiplexed capture may uncover otherwise missed copy number variation and increase the diagnostic yield for patients.
Multiplexed capture is not without limitations. We presented an example (pool IDT-MC) where multiplexed capture provided little to no improvement over independently-captured samples. We concluded the absent improvement in inter-sample variance stemmed from the poor library balance prior to capture. Rebuilding a more-balanced pool with the same samples (pool IDT-RR) demonstrated a large reduction in inter-sample variance.
In assessing the inter-sample variance, we compared two capture platforms: (1) Agilent SureSelectXT2 and (2) Integrated DNA Technologies xGen Lockdown Probes. We do not have enough data to suggest definitively one over the other. Comparing the mean-variance relationship, the IDT-RR pool appeared to have less dispersion overall (Figure 2.4); however, the sample-specific dispersion estimates from ExomeDepth suggest better performance by the WGS pool (Figure 2.8) and the higher pool-wide dispersion comes entirely from the two poorly correlated samples.
Our results suggest having a sufficiently large database of samples most-often provides appropriate control samples to estimate copy number variation (Figure 2.5). However, we show laboratories can circumvent the need for large samples by multiplexing the capture step. Defining the capture pool as the set of controls both limits the need for regular reanalysis as the database grows and eliminates potential over-selecting of samples with the same variants.
At the depth of the WGS pool, our simplistic simulation study would suggest both mcCNV and ExomeDepth have the power to detect single-exon variants with >85% sensitivity while maintaining a low false-discovery rate (Figure 2.9, Table 2.2). However, comparing the exome calls to the genome calls for the WGS pool revealed lackluster concordance. As Trost et al. point out, the genome CNV callers still struggle with variants less than 1 kb.98 We do not dismiss the possibility of exome calls providing greater reliability than the genome calls, given multiplexed capture and adequate sequencing depth. However, given the distribution of calls throughout the exome, we doubt the thousands of excess deletions called for NCG_00790 and NCG_00851. Confirmation of the individual calls is beyond the scope of this work.
Unsurprisingly, both mcCNV and ExomeDepth failed to call many of the duplications called from the genome data. The variance for the negative binomial increases as the mean increases; we expect greater variation in read depth from duplicated loci, making duplications more difficult to distinguish. Similarly, the variance of the binomial proportion increases monotonically over [0, 0.5). More sensitive detection of duplications will likely require greater sequencing depth.
The simulation study emphasizes the importance of sequencing depth (in terms of absolute molecules). We can collect increased basepair coverage for less money by sequencing longer reads (e.g. 2x150 versus 2x50), but doing so decreases power for depth-based CNV calling. Typically, exome sequencing targets 30-50x coverage to ensure most targets have sufficient coverage for accurate basepair calling. We demonstrate the need for much deeper sequencing if we wish to establish exon-level variants.
Taken together, we recommend the following: (1) research and clinical endeavors consider adjusting protocols to multiplex samples prior to any targeted capture; (2) prior to capture, we suggest checking the library balance and adjusting as necessary (we achieved reasonable performance with relative standard deviation values less than 25%); (3) collecting an average of 225 filtered read-pairs per target. We then provide a simple-to-use and efficient R package to estimate copy number utilizing the negative bionimal distribution.
We believe the uncertainty about the prevalence and clinical significance of exon-level variants warrants a large undertaking. Even if we take the conservative approach and looking only at concordant calls between genome and exome sequencing (Figure 2.11), we have an average of 40 variants per sample to contend with. Two possibilities exist: (1) the algorithms all fail over specific regions, or (2) some genes can tolerate intrageneic copy-number variation better than others. Having eliminated calls from repetitive and low-complexity regions, we believe possibility (2) is more likely. To truly determine the prevalence (and therefore, clinical significance) of exon-level variants we need to interrogate exon-level variants on a large cohort. Confirmation testing for the tens to thousands of predicted variants from the exome and genome calls would allow true determination of algorithm performance and inform the clinical utility.