Chapter 2 A novel copy number variant algorithm

2.1 Introduction

In human genetics, individuals normally have two copies of each locus in the genome (one inherited from each parent). Deviations from the normal diploid state, known broadly as copy number variation, can cause phenotypic changes and Mendelian disorders. Technologies, e.g. microarray, exist for reliably detecting large (greater than 100 kilobases) copy number variants (CNVs). Over the last decade, the availability short-read DNA sequencing compelled numerous efforts to identify and characterize smaller variants. Sequencing cost, data burden, and the problem of classifying intronic and non-coding variants have led to exome sequencing (ES) as the preferred clinical sequencing modality. ES analysis most often focuses on identifying pathogenic single-nucleotide variants and insertion/deletions. CNV analysis has demonstrated limited improvement in diagnostic yield,⁸⁵ but existing data/analysis lacks power to detect exon-level variation.^86,87

Current analytic methodologies adequately detect large CNVs, but require large amounts of data and lack resolution for intragenic exon-level variation.^46,88–90 The prevalence and clinical importance of exon-level CNVs remains largely unknown due to inadequate power in ES studies and limited access to clinical genome sequencing data. Recent work on a subset of 1507 genes suggests intragenic CNVs account for 1.9% of total variants, but 9.8% of pathogenic variants.⁹¹ Additionally, the authors demonstrated 627/2844 (22%) of identified CNVs spanned a single (598) or partial (29) exon.⁹¹

Targeted sequencing requires capturing the desired loci (e.g. exons) using sequence-specific oligonucleotide baits. The differential efficiency of baits, even when carefully designed and balanced, leads to variable read-depth across the exome. The GC content and length of targeted fragments both contribute to the observed variable read-depth;⁹² most ES analysis platforms incorporate correction for GC content and exon length.⁹³ The variable read-depth in ES precludes the single-sample window-smoothing approaches successfully applied in GS data,⁹⁴ therefore we must rely on comparative analysis for interrogating copy number.

Comparing multiple samples, each captured independently, compounds the variable read-depth problem. The capture probability for each exon correlates between samples but with high variability.⁴⁶ In other words, we can gain information from similarly captured samples, but independent captures introduce significant noise. ExomeDepth attempts to circumvent the capture-to-capture variation by identifying a subset of samples from a large pool with low inter-sample variability.⁴⁶ Alternatively, CoNIFER,⁹⁰ XHMM,⁸⁸ and CODEX⁸⁹ use a latent factor model with spectral value decomposition to remove systematic noise, presumably introduced by capture-to-capture variation. These methods generally require very large sample sizes, and often still lack power for exon-level resolution (e.g. CODEX defines a “short” CNV as spanning 5 contiguous exons).

Here, we explore how multiplexing the capture across samples reduces inter-sample variance, increasing the power to detect CNVs. We also introduce our own algorithm, mcCNV (“multiplexed capture CNV”), specifically designed to utilize multiplexed capture exome data for estimating exon-level variation without prior information.

2.2 Methods

2.2.1 Exome sequencing

We performed sequencing on human samples of purified DNA obtained from the Wilhelmsen laboratory collection, the NCGENES cohort,⁹⁵ and the Coriell Institute in compliance with the UNC Institutional Review Board. We also utilized existing read-level data from the NCGENES⁹⁵ project. We compared the performance of two capture platforms: (1) Agilent SureSelect XT2 (multiplexed capture)/Agilent SureSelect XT (independent capture); (2) Integrated DNA Technologies (IDT) xGen Lockdown Probes. We utilized Human All Exome v4 baits (Agilent) and Exome Research Panel v1 baits (IDT). All captures performed according to manufacturer protocol, with the following exceptions: (1) we multiplexed 16 samples versus the recommended 8 for the XT2 protocol for some pools; (2) for Pool2, we performed the fragmentation step 5 times, to test whether a more uniform fragment length distribution would improve capture.

All sequencing performed with Illumina paired-end chemistry. We aligned paired reads to hg19v0 (GATK resource bundle) using BWA-MEM⁹⁶ and removed duplicate reads using Picard tools. We then used our novel R package, mcCNV, to count the number of overlapping molecules (read-pairs) per exon. For inclusion, we required properly-paired molecules with unambiguous mapping for one read and mapping quality greater than or equal to 20 for both reads. Full Snakemake⁹⁷ pipeline provided in supplemental materials. Table 2.1 provides an overview of the exome sequencing included.

2.2.2 Genome sequencing

For the 16 samples in the “WGS” pool, we performed genome sequencing to an average 50x coverage. We followed Trost et al. recommendations for making read-depth based CNV calls.⁹⁸ Briefly, we mapped paired-reads identical to our targeted sequencing data. We then interrogated the read depth interquartile range using samtools depth,²⁹ recalibrated base-quality scores and called sequence variants using GATK,⁴⁰ and called copy number variants using the ERDS⁴⁹ and cnvpytor (updated implementation of CNVnator)⁴⁵ algorithms. Full Snakemake⁹⁷ pipeline provided in supplemental materials.

2.2.3 Simulating targeted sequencing

To simulate targeted capture, we represent the capture process as a large multinomial distribution defining the probability of capture at each target. We use an alternate definition of copy state, such that 1 represents the normal diploid state. Let $N$ represent the total number of molecules (read pairs) and $e_j \in \mathbb{E}$ represent the probability of capturing target $j$, then for each subject, $i$:

Randomly select $s_{ij} \in \mathbb{S}_i$ from $S = \{0.0, 0.5, 1, 1.5, 2\}$ as the copy number at target $j$
Adjust the subject-specific capture probabilities by the copy number, $\mathbb{E}_i = \frac{\mathbb{E} \odot \mathbb{S}_{i}}{\sum_j \mathbb{E} \odot \mathbb{S}_{i}}$
Draw $N$ times from $\text{Multinomial}(\mathbb{E}_i)$, giving the molecule counts at each target $j$ for sample $i$, $c_{ij} \in \mathbb{C}_i$

We provide functionality within the mcCNV R package for producing reproducible simulations.

2.2.4 mcCNV algorithm

The mcCNV algorithm was adapted from the sSEQ method for quantifying differential expression in RNA-seq experiments with small sample sizes.⁹⁹ Yu et al. provide detailed theoretical background of the negative binomial model and using shrinkage to improve dispersion estimates. The mcCNV algorithm adjusts the sSEQ probability model by adding a multiplier for the copy state: \[ C_{ij} \sim \mathcal{NB}(f_is_{ij}\hat\mu_j, \tilde\phi_j/f_i) \] where the random variable $C_{ij}$ represents observed molecule counts for subject $i$ at target $j$, $f_i$ is the size factor for subject $i$, $s_{ij}$ is the copy state, $\mu_j$ is the expected mean under the diploid state at target $j$, and $\tilde\phi_j$ is the shrunken phi at target $j$. We observe $c_{ij}$ and wish to estimate $s_{ij}$, $\hat{s}_{ij}$. Initialize by setting $\hat{s}_{ij} = 1$ for all $i,j$. Then,

Adjust the observed values for the estimated copy-state, \[ c_{ij}^{\prime} = \frac{c_{ij}}{\hat{s}_{ij}}. \]
Subset $c_{ij}^{\prime}$ such that $c_{ij}^{\prime} > 10, ~ \hat{s}_{ij} > 0$
Calculate the size-factor for each subject \[ f_i = \text{median}\left(\frac{c_{ij}^{\prime}}{g_j}\right), \] where $g_j$ is the geometric mean at each exon.
Use method of moments to calculate the expected dispersion \[ \hat\phi_j = \max\left(0, \frac{\hat\sigma_j^2 - \hat{\mu}_j}{\hat{\mu}_j^2}\right) \] where $\hat{\mu}_j$ and $\hat{\sigma}_j^2$ are the sample mean and variance of $c_{ij}^{\prime}/f_i$.
Let $J$ represent the number of targets. Shrink the phi values to \[ \tilde\phi_j = (1 - \delta)\hat\phi_j + \delta\hat{\xi} \] such that \[ \delta = \frac{\sum\limits_j\left(\hat\phi_j - \frac{1}{n_j}\sum\limits_j \hat\phi_j\right)^2/(J - 1)} {\sum\limits_j\left(\hat\phi_j - \hat{\xi}\right)^2/(n_j - 2)} \] and \[ \hat{\xi} = \mathop{\text{argmin}}\limits_{\xi}\left\{ \frac{d}{d\xi}\frac{1}{\sum\limits_j \left(\hat\phi_j - \xi\right)^2} \right\}. \]
Update $\hat{s}_{ij}$, \[ \mathop{\text{argmax}}\limits_{s \in S}\left\{ \mathcal{L}(s \rvert c_{ij},f_i,\hat\mu_j,\tilde\phi_j) \right\} \] where $S = \{0.001, 0.5, 1, 1.5, 2\}$.
Repeat until the number of changed states falls below a threshold or a maximum number of iterations is reached.
After convergence, calculate p-values for the diploid state, $\pi_{ij} = \text{Pr}(s_{ij} = 1)$.
Adjust p-values using the Benjamini–Hochberg procedure¹⁰⁰ and filter to a final call-set such that adjusted p-values fall below some threshold, $\alpha$.

2.3 Results

2.3.1 Multiplexed capture reduces inter-sample variance

ES requires using molecular baits to “capture” the exonic DNA fragments during the library preparation (prior to sequencing). Most laboratories capture each sample individually. The capture efficiency varies with timing, temperature, and substrate concentrations, making identical capture reproduction impossible. Alternatively, one could multiplex (pool) samples prior to capture, capturing the pool of samples simultaneously. Here we profile the inter-sample variance of individual capture versus multiplexed capture.

A multinomial process provides a logical framework for modeling targeted capture, each target represented by an individual outcome. We can estimate the multinomial probability simplex for an exome capture by dividing the observed counts at each exon by the total mapped reads for the exome. The dirichlet distribution, conjugate prior to the multinomial, defines distributions of probability simplexes. The dirichlet distribution is parameterized by $\boldsymbol{\alpha} = \{\alpha_1, \alpha_2, \dots, \alpha_n\}$, where the expected probability for outcome $i$ is given by $\alpha_i/\alpha_0,~\alpha_0 = \sum \boldsymbol\alpha$. If $\boldsymbol\pi$ is a probability simplex drawn from a dirichlet with $\boldsymbol\alpha$, then the variance of $\boldsymbol\pi$ is inversely proportional to $\alpha_0$. Therefore, we can approximate the inter-sample variance by fitting the dirichlet distribution to each pool and interrogating the mean $\alpha$.

Table 2.1: Summary of whole-exome sequencing. “pool” indicates the name of the pool of samples; “capture” indicates the capture platform for the pool; “N” gives the number of samples in the pool; “medExon” gives the pool median of the subject median mapped molecule count per exon; “medTotal” gives the median by pool of total mapped molecule counts per subject; “minTotal” and “maxTotal” give the minimum and maximum total mapped molecules; “rsdTotal” gives the relative standard deviation (SD/mean*100) of total mapped molecules. $^\dagger$ indicates captures were performed independently on each sample within the pool, otherwise captures were multiplexed across all samples within the pool.
pool	capture	N	medExon	medTotal	minTotal	maxTotal	rsdTotal
IDT-IC$^\dagger$	IDT	16	143	55,149,058	37,453,015	85,138,915	22.4
IDT-MC	IDT	16	93	29,772,684	16,674,468	118,147,912	64.2
IDT-RR	IDT	16	272	79,079,629	61,289,322	120,147,888	22.9
NCGENES$^\dagger$	Agilent	112	93	24,451,245	12,749,793	68,565,471	27.6
Pool1	Agilent	16	56	13,265,614	8,911,132	17,324,903	18.5
Pool2	Agilent	16	86	21,076,056	4,585,195	27,846,146	27.6
SMA1	Agilent	8	56	12,256,002	11,051,840	13,600,697	6.2
SMA2	Agilent	8	25	5,622,040	4,904,000	6,545,360	10.4
WGS	Agilent	16	196	46,406,224	36,496,097	65,200,410	16.4

Using multiplexed capture, we sequenced 3 16-sample pools and 2 8-sample pools with Agilent baits and 2 16-sample pools with IDT baits (Table 2.1). To compare to individually-captured Agilent data, we randomly-selected 5 16-sample pools from the NCGENES cohort. For numeric stability, we subset to exons with at least 5 and no greater than 2000 counts across all samples within a pool. We then used a Newton-Raphson algorithm¹⁰¹ to fit the dirichlet distribution to each pool; all pools converged to stable estimates. We found, with one exception, multiplexed capture pools had greater $\alpha_0$ of their independently-captured counterparts (Figure 2.1).

$Multiplexed capture (MC) decreases variance with respect to independent captures (IC), as estimated by fitting the dirichlet distribution. Total counts/sample given on the horizontal axis; mean $\alpha$ given on the vertical axis. $\alpha_0$ is inversely proportional to inter-sample variance. Each line/point represents a single pool. The point indicates the median total counts across the pool, with the range given by the line. Orange indicates a multiplexed capture; blue indicates independent captures. Triangles indicate pools using Agilent (AGL) capture; squares indicate Integrated DNA Technologies (IDT).$

Figure 2.1: Multiplexed capture (MC) decreases variance with respect to independent captures (IC), as estimated by fitting the dirichlet distribution. Total counts/sample given on the horizontal axis; mean $\alpha$ given on the vertical axis. $\alpha_0$ is inversely proportional to inter-sample variance. Each line/point represents a single pool. The point indicates the median total counts across the pool, with the range given by the line. Orange indicates a multiplexed capture; blue indicates independent captures. Triangles indicate pools using Agilent (AGL) capture; squares indicate Integrated DNA Technologies (IDT).

The multiplexed pool without decreased inter-sample variance, IDT-MC, had a much larger spread in sequencing depth across the pool (Table 2.1, Figure 2.1). Looking at the total mapped molecules, the IDT-MC pool had a relative standard deviation of 64.2%, over double the next highest pool. We hypothesized the absent reduction in variation stemmed from poor library balance during the multiplexing step. We subsequently captured a new pool using the same DNA input, IDT-RR, and found comparable reductions in inter-sample variance (the pool with the highest $\alpha_0$ in Figure 2.1).

Examining the mean-variance relationship demonstrated the same inter-sample variance reduction suggested by the dirichlet parameter estimates (Figures 2.2 and 2.3). The Agilent pools (Figure 2.2) segregated cleanly, with less dispersion in the multiplexed capture pools. Again, we found no variance reduction for the IDT-MC pool, overlapping with independently-captured IDT-IC pool (Figure 2.2). We did, however, observe near-complete reduction in dispersion for the better-balanced IDT-RR pool.

Mean-variance relationship for Agilent (AGL) pools. Mean counts per exon given on the horizontal axis; mean variance per exon given on the vertical axis. Contours show the distribution of points by pool. Dotted lines show the ordinary least squares regression fit. Orange indicates multiplexed capture pools; blue indicates independently captured pools. The dashed gray line represents the 1:1 relationship expected under a Poisson process. Lines above the plot show the density of mean values by pool; lines to the right of the plot show the density of variance values by pool.

Figure 2.2: Mean-variance relationship for Agilent (AGL) pools. Mean counts per exon given on the horizontal axis; mean variance per exon given on the vertical axis. Contours show the distribution of points by pool. Dotted lines show the ordinary least squares regression fit. Orange indicates multiplexed capture pools; blue indicates independently captured pools. The dashed gray line represents the 1:1 relationship expected under a Poisson process. Lines above the plot show the density of mean values by pool; lines to the right of the plot show the density of variance values by pool.

Mean-variance relationship for Integrated DNA Technologies (IDT) pools. Mean counts per exon given on the horizontal axis; mean variance per exon given on the vertical axis. Contours show the distribution of points by pool. Dotted lines show the ordinary least squares regression fit. Orange indicates multiplexed capture pools; blue indicates independently captured pools. The dashed gray line represents the 1:1 relationship expected under a Poisson process. Lines above the plot show the density of mean values by pool; lines to the right of the plot show the density of variance values by pool.

Figure 2.3: Mean-variance relationship for Integrated DNA Technologies (IDT) pools. Mean counts per exon given on the horizontal axis; mean variance per exon given on the vertical axis. Contours show the distribution of points by pool. Dotted lines show the ordinary least squares regression fit. Orange indicates multiplexed capture pools; blue indicates independently captured pools. The dashed gray line represents the 1:1 relationship expected under a Poisson process. Lines above the plot show the density of mean values by pool; lines to the right of the plot show the density of variance values by pool.

Comparison of mean-variance relationship between WGS pool (blue) and IDT-RR pool (orange). Mean count by exon given on horizontal axis; variance of exon counts given on horizontal axis. Dotted lines show the ordinary least-squares fit. Lines above plot show the distribution of mean values; lines to the right of the plot show the distribution of variance values.

Figure 2.4: Comparison of mean-variance relationship between WGS pool (blue) and IDT-RR pool (orange). Mean count by exon given on horizontal axis; variance of exon counts given on horizontal axis. Dotted lines show the ordinary least-squares fit. Lines above plot show the distribution of mean values; lines to the right of the plot show the distribution of variance values.

2.3.2 Multiplexed capture provides controls for ExomeDepth

ExomeDepth requires a set of control subjects, summed into a reference vector of counts at each exon. ExomeDepth provides functionality to select appropriate controls from a set of subjects, often requiring large numbers of subjects to identify appropriate controls. Smaller research groups and clinical laboratories may struggle building large databases of exomes, with the difficulty compounded by lot-to-lot variation and regular improvements to capture and sequencing chemistries. We wanted to know if the reduced inter-sample variance with multiplexed capture could provide an appropriate control set for ExomeDepth, eliminating the need for large databases of similarly-captured exomes. We found the reduced inter-sample variance with multiplexed capture leads to appropriate control selection for ExomeDepth (Figures 2.5). Pool2, where we repeated the initial fragmentation 5 times, did not perform as well as the other multiplexed pools. We also found two samples within the WGS pool did not correlate well with the rest of the pool.

Median count per exon. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Figure 2.5: Median count per exon. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Total number of controls selected by ExomeDepth. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Figure 2.6: Total number of controls selected by ExomeDepth. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Proportion of available samples selected by ExomeDepth as a control. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Figure 2.7: Proportion of available samples selected by ExomeDepth as a control. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Estimated phi parameter from ExomeDepth. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

Figure 2.8: Estimated phi parameter from ExomeDepth. Each point represents a single sample, with samples grouped by pool. Triangles indicate independently-captured samples; circles indiciate a single multiplexed capture within the pool. Dotted vertical line separates the two capture platforms.

When we looked at independently-captured subjects, we found appropriate control sets for most of the 112 NCGENES subjects (Figure 2.6). However, ExomeDepth only selected 12.2% of available samples as controls, on average (Figure 2.7). Similarly, with the independently-captured IDT-IC pool we find low control numbers for most samples. While possible to select the same number of controls but exhibit differing dispersion, we observed little difference in the dispersion between independent and multiplexed capture (Figure 2.8). Overall, multiplexed capture provided appropriate controls for most samples tested, however an adequately-large set of available controls delivered comparable performance.

2.3.3 mcCNV & ExomeDepth perform comparably in simulation study

To compare our mcCNV algorithm and ExomeDepth, we created synthetic pools of data across different sequencing depths. Based on our observations with the real data, we selected the total number of molecules for each sample from a uniform distribution defined as a 30% window on either side of the specified depth; for example, for a specified depth of 10 million molecules, we drew the molecules per sample from 7 to 13 million molecules. For each depth ranging from 5 to 100 million molecules, we simulated 200 16-sample pools with single-exon variants. We allowed for homozygous and heterozygous deletions and duplications (0 to 4 copies), such that all variants were equally likely and the total variant probability was $1/1,000$. We used, as the starting capture probabilities ($\mathbb{E}$), the empiric capture probabilities observed by summing across the Pool1 pool.

We analyzed each of the 4,000 pools (200 replicates by 20 depths) using our algorithm and two iterations of ExomeDepth. For the first iteration of ExomeDepth, we used the default values for transition probability ($1/10,000$) and expected variant length (50 kb). For the second iteration, we used the true simulated variant prior for the transition probability ($1/1,000$) and an expected variant length of 1 kb. As expected, the sensitivity increased and false discovery rate decreased as the sequencing depth increased (Figures 2.9). In both comparisons, mcCNV demonstrated a superior false-discovery rate. When interrogating Matthew’s correlation coefficient¹⁰² and the sensitivity, we found mcCNV had marginal performance over ExomeDepth with default parameters and marginal performance under ExomeDepth with simulation-matched parameters. Table 2.2 provides the actual values.

$Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, $1/10,000$ transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, $1/1,000$ transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.$ $Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, $1/10,000$ transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, $1/1,000$ transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.$ $Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, $1/10,000$ transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, $1/1,000$ transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.$ $Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, $1/10,000$ transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, $1/1,000$ transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.$ $Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, $1/10,000$ transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, $1/1,000$ transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.$ $Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, $1/10,000$ transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, $1/1,000$ transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.$

Figure 2.9: Algorithm performance comparing mcCNV and ExomeDepth on simulated exomes. (A-C) mcCNV versus ExomeDepth with default parameters, $1/10,000$ transition probability and 50 kb expected variant length. (D-F) mcCNV versus ExomeDepth with simulation-matched parameters, $1/1,000$ transition probability and 1 kb expected variant length. Numbered points indicate the simulated depth in millions of molecules. ‘MCC’ indicates Matthew’s correlation coefficient; ‘TPR’ indicates true positive rate/sensitivity; ‘FDR’ indicates false discovery rate. Dashed black line shows the 1:1 relationship.

Table 2.2: Simulation results by algorithm. ED-def: ExomeDepth with default parameters; ED-sim: ExomeDepth with simulation-matched parameters. Values represent the mean over 200 simulations.
	MCC			TPR			FDR
dep	mcCNV	ED-def	ED-sim	mcCNV	ED-def	ED-sim	mcCNV	ED-def	ED-sim
5	0.713	0.401	0.519	0.522	0.192	0.298	0.02230	0.15900	0.09260
10	0.694	0.628	0.708	0.503	0.431	0.549	0.04250	0.08450	0.08590
15	0.781	0.742	0.801	0.627	0.581	0.690	0.02600	0.05270	0.06940
20	0.840	0.811	0.857	0.719	0.682	0.777	0.01810	0.03420	0.05360
25	0.879	0.856	0.893	0.783	0.752	0.832	0.01310	0.02460	0.04090
30	0.907	0.889	0.918	0.831	0.804	0.872	0.00967	0.01750	0.03210
35	0.926	0.909	0.935	0.864	0.839	0.897	0.00807	0.01370	0.02600
40	0.941	0.927	0.948	0.891	0.869	0.917	0.00638	0.01060	0.02080
45	0.952	0.940	0.957	0.911	0.892	0.932	0.00527	0.00846	0.01680
50	0.961	0.950	0.965	0.927	0.910	0.944	0.00437	0.00701	0.01370
55	0.966	0.957	0.969	0.937	0.921	0.951	0.00377	0.00569	0.01180
60	0.972	0.963	0.974	0.947	0.933	0.959	0.00318	0.00517	0.00986
65	0.976	0.969	0.978	0.955	0.943	0.964	0.00290	0.00433	0.00837
70	0.978	0.972	0.980	0.960	0.949	0.968	0.00252	0.00381	0.00735
75	0.981	0.976	0.983	0.965	0.955	0.972	0.00212	0.00321	0.00625
80	0.983	0.978	0.985	0.969	0.960	0.975	0.00200	0.00294	0.00560
85	0.985	0.980	0.986	0.972	0.963	0.977	0.00181	0.00263	0.00491
90	0.987	0.982	0.987	0.975	0.967	0.979	0.00169	0.00243	0.00451
95	0.988	0.984	0.988	0.978	0.970	0.981	0.00156	0.00223	0.00393
100	0.989	0.985	0.989	0.980	0.973	0.982	0.00150	0.00195	0.00359

2.3.4 mcCNV & ExomeDepth perform comparably on WGS pool

To establish a truth set on real data, we performed matched genome sequencing on the subjects included in the WGS pool. Following the best practices suggested by Trost et al.,⁹⁸ we performed read-depth based CNV calling using the genome data. In line with recommendations by Trost et al., we excluded from comparative analysis any exons overlapping repetitive or low-complexity regions (34,856 out of 179,250). We then compared the exome calls using mcCNV and ExomeDepth to the genome calls using the overlap of ERDS and cnvpytor. Table 2.3 lists the total calls by subject. Overall, mcCNV predicted the largest number of variants; however, 85.7% of predicted variants were deletions from two samples (NCG_00790 and NCG_00851). ExomeDepth also predicted a disproportionate number of deletions for NCG_00790 and NCG_00851, totaling 69.4% of calls.

Table 2.3: Number of CNV calls by subject and algorithm for the ‘WGS’ pool. ‘MC’ indicates the mcCNV algorithm; ‘ED’ indicates the ExomeDepth algorithm; ‘WG’ indicates the overlap of ERDS/cnvpytor calls from matched whole-genome sequencing. Exons with any overlap of the repetitive and low-complexity regions, as defined in the Trost et al. manuscript,⁹⁸ omitted from analysis.
	Total			Duplications			Deletions
subject	MC	ED	WG	MC	ED	WG	MC	ED	WG
NCG_00012	90	106	143	61	73	121	29	33	22
NCG_00237	82	101	165	50	64	129	32	37	36
NCG_00525	68	74	151	30	33	110	38	41	41
NCG_00593	45	58	142	22	28	81	23	30	61
NCG_00676	66	78	112	38	46	92	28	32	20
NCG_00790	5,156	2,204	121	19	37	92	5,137	2,167	29
NCG_00819	68	76	134	30	41	100	38	35	34
NCG_00840	78	92	157	44	52	115	34	40	42
NCG_00851	1,151	859	141	28	51	102	1,123	808	39
NCG_00857	59	75	119	10	15	81	49	60	38
NCG_00976	46	58	114	25	37	93	21	21	21
NCG_01023	59	95	143	32	60	113	27	35	30
NCG_01043	73	94	128	40	64	105	33	30	23
NCG_01076	36	57	105	7	22	78	29	35	27
NCG_01077	135	157	230	103	121	184	32	36	46
NCG_01117	95	101	154	72	78	129	23	23	25

Looking at the control selection, for NCG_00790 and NCG_00851 ExomeDepth only selected 2 and 3 controls, respectively. Furthermore, NCG_00790 and NCG_00851 had substantially higher dispersion than the rest of the pool (two outliers in Figure 2.8).

Recognizing the genome calls do not represent an accurate truth set, we looked at the ability of mcCNV and ExomeDepth to predict the genome calls. Due to the large number of deletions called for NCG_00790 and NCG_00851, both algorithms performed poorly in predicting the genome calls (Table 2.4). When we excluded NCG_00790 and NCG_00851 from the analysis, mcCNV had comparable, but uniformly better performance. Both algorithms demonstrated greater power to detect deletions. Figures 2.10 and 2.11 show the call overlap between the three approaches, including and excluding NCG_00790 and NCG_00851, respectively.

Table 2.4: mcCNV (MC)/ExomeDepth (ED) calls for ‘WGS’ pool (used as prediction) versus the ERDS/cnvpytor calls from matched genome sequencing (used as truth). Calls are subdivided by duplications (DUP) and deletions (DEL). ‘Full’ gives performance across the full pool; ‘Sub’ gives the performance excluding the poorly correlated samples NCG_00790 and NCG_00851 (gray rows). ‘MCC’ is Matthew’s correlation coefficient, ‘TPR’ is true positive rate/sensitivity, ‘FDR’ is false discovery rate, ‘PPV’ is positive predictive value, ‘BalAcc’ is balanced accuracy. Exons with any overlap of the repetitive and low-complexity regions, as defined in the Trost et al. manuscript,⁹⁸ omitted from analysis.
			MCC	TPR	FDR	PPV
ALL	Total	MC	0.185	0.335	0.897	0.1030
	Total	ED	0.263	0.363	0.809	0.1910
	Sub	MC	0.487	0.345	0.311	0.6890
	Sub	ED	0.482	0.378	0.383	0.6170
DUP	Total	MC	0.396	0.236	0.334	0.6660
	Total	ED	0.347	0.240	0.496	0.5040
	Sub	MC	0.404	0.246	0.333	0.6670
	Sub	ED	0.384	0.266	0.446	0.5540
DEL	Total	MC	0.180	0.639	0.949	0.0509
	Total	ED	0.219	0.558	0.914	0.0861
	Sub	MC	0.683	0.661	0.294	0.7060
	Sub	ED	0.541	0.554	0.471	0.5290

Copy number variant call concordance for the WGS pool. (A) predicted duplications; (B) predicted deletions. mcCNV in grey; ExomeDepth in blue; ERDS/cnvpytor in orange. Values within overlaps give the number of variants.

Figure 2.10: Copy number variant call concordance for the WGS pool. (A) predicted duplications; (B) predicted deletions. mcCNV in grey; ExomeDepth in blue; ERDS/cnvpytor in orange. Values within overlaps give the number of variants.

Copy number variant call concordance for the WGS pool, excluding subjects NCG_00790 and NCG_00851 due to poor correlation to the rest of the pool. (A) predicted duplications; (B) predicted deletions. mcCNV in grey; ExomeDepth in blue; ERDS/cnvpytor in orange. Values within overlaps give the number of variants.

Figure 2.11: Copy number variant call concordance for the WGS pool, excluding subjects NCG_00790 and NCG_00851 due to poor correlation to the rest of the pool. (A) predicted duplications; (B) predicted deletions. mcCNV in grey; ExomeDepth in blue; ERDS/cnvpytor in orange. Values within overlaps give the number of variants.

2.4 Discussion

The medical genetics community still lacks robust exome-wide information about small (exon-level) variant prevalence. Others have established the reliability and cost-efficiency of pre-capture multiplexing,^103–107 and most commercial exome capture platforms have protocols for pre-capture multiplexing. Here, we demonstrate the reduction in inter-sample variance when pre-capture multiplexing, leading to increased power to detect exon-level copy number variation. Despite the benefits, many clinical laboratories do not employ a multiplexed capture protocol because multiplexing requires waiting to fill a pool and may delay results. While we understand the increased complexity, multiplexed capture may uncover otherwise missed copy number variation and increase the diagnostic yield for patients.

Multiplexed capture is not without limitations. We presented an example (pool IDT-MC) where multiplexed capture provided little to no improvement over independently-captured samples. We concluded the absent improvement in inter-sample variance stemmed from the poor library balance prior to capture. Rebuilding a more-balanced pool with the same samples (pool IDT-RR) demonstrated a large reduction in inter-sample variance.

In assessing the inter-sample variance, we compared two capture platforms: (1) Agilent SureSelectXT2 and (2) Integrated DNA Technologies xGen Lockdown Probes. We do not have enough data to suggest definitively one over the other. Comparing the mean-variance relationship, the IDT-RR pool appeared to have less dispersion overall (Figure 2.4); however, the sample-specific dispersion estimates from ExomeDepth suggest better performance by the WGS pool (Figure 2.8) and the higher pool-wide dispersion comes entirely from the two poorly correlated samples.

Our results suggest having a sufficiently large database of samples most-often provides appropriate control samples to estimate copy number variation (Figure 2.5). However, we show laboratories can circumvent the need for large samples by multiplexing the capture step. Defining the capture pool as the set of controls both limits the need for regular reanalysis as the database grows and eliminates potential over-selecting of samples with the same variants.

At the depth of the WGS pool, our simplistic simulation study would suggest both mcCNV and ExomeDepth have the power to detect single-exon variants with >85% sensitivity while maintaining a low false-discovery rate (Figure 2.9, Table 2.2). However, comparing the exome calls to the genome calls for the WGS pool revealed lackluster concordance. As Trost et al. point out, the genome CNV callers still struggle with variants less than 1 kb.⁹⁸ We do not dismiss the possibility of exome calls providing greater reliability than the genome calls, given multiplexed capture and adequate sequencing depth. However, given the distribution of calls throughout the exome, we doubt the thousands of excess deletions called for NCG_00790 and NCG_00851. Confirmation of the individual calls is beyond the scope of this work.

Unsurprisingly, both mcCNV and ExomeDepth failed to call many of the duplications called from the genome data. The variance for the negative binomial increases as the mean increases; we expect greater variation in read depth from duplicated loci, making duplications more difficult to distinguish. Similarly, the variance of the binomial proportion increases monotonically over [0, 0.5). More sensitive detection of duplications will likely require greater sequencing depth.

The simulation study emphasizes the importance of sequencing depth (in terms of absolute molecules). We can collect increased basepair coverage for less money by sequencing longer reads (e.g. 2x150 versus 2x50), but doing so decreases power for depth-based CNV calling. Typically, exome sequencing targets 30-50x coverage to ensure most targets have sufficient coverage for accurate basepair calling. We demonstrate the need for much deeper sequencing if we wish to establish exon-level variants.

Taken together, we recommend the following: (1) research and clinical endeavors consider adjusting protocols to multiplex samples prior to any targeted capture; (2) prior to capture, we suggest checking the library balance and adjusting as necessary (we achieved reasonable performance with relative standard deviation values less than 25%); (3) collecting an average of 225 filtered read-pairs per target. We then provide a simple-to-use and efficient R package to estimate copy number utilizing the negative bionimal distribution.

We believe the uncertainty about the prevalence and clinical significance of exon-level variants warrants a large undertaking. Even if we take the conservative approach and looking only at concordant calls between genome and exome sequencing (Figure 2.11), we have an average of 40 variants per sample to contend with. Two possibilities exist: (1) the algorithms all fail over specific regions, or (2) some genes can tolerate intrageneic copy-number variation better than others. Having eliminated calls from repetitive and low-complexity regions, we believe possibility (2) is more likely. To truly determine the prevalence (and therefore, clinical significance) of exon-level variants we need to interrogate exon-level variants on a large cohort. Confirmation testing for the tens to thousands of predicted variants from the exome and genome calls would allow true determination of algorithm performance and inform the clinical utility.

pool	capture	N	medExon	medTotal	minTotal	maxTotal	rsdTotal
IDT-IC\(^\dagger\)	IDT	16	143	55,149,058	37,453,015	85,138,915	22.4
IDT-MC	IDT	16	93	29,772,684	16,674,468	118,147,912	64.2
IDT-RR	IDT	16	272	79,079,629	61,289,322	120,147,888	22.9
NCGENES\(^\dagger\)	Agilent	112	93	24,451,245	12,749,793	68,565,471	27.6
Pool1	Agilent	16	56	13,265,614	8,911,132	17,324,903	18.5
Pool2	Agilent	16	86	21,076,056	4,585,195	27,846,146	27.6
SMA1	Agilent	8	56	12,256,002	11,051,840	13,600,697	6.2
SMA2	Agilent	8	25	5,622,040	4,904,000	6,545,360	10.4
WGS	Agilent	16	196	46,406,224	36,496,097	65,200,410	16.4