Chapter 1 Introduction

1.1 Outline

This chapter outlines the historical context for human and medical genetics, with a basic overview of the topics necessary to understand the original work presented in this dissertation.

Human Genetics Primer summarizes DNA discovery as the hereditary material in eukaryotic life (1.2.1) and describes the types of variation observed in DNA (1.2.2).

DNA Sequencing describes the development of DNA sequencing, with an overview of first generation sequencing (1.3.1), the discoveries that led to second generation sequencing (1.3.2), and a more detailed discussion of the Solexa/Illumina sequencing platform (1.3.3). The section concludes with a longer discussion of the considerations and challenges with analyzing second-generation sequencing data (1.3.4).

Development of Medical Genetics highlights some of the most important landmarks in the origination of medical genetics as both a research topic and a clinical profession.

Finally, Rationale briefly describes the motivation for my dissertation work.

1.2 Human Genetics Primer

1.2.1 Discovery of DNA and the central dogma

The discovery of deoxyribonucleic acid (DNA) took roughly 100 years of work, and has fundamentally changed how we view ourselves, society, and life. The study of genetics begins with the study of peas and the discovery of inheritance by Gregor Mendel in the middle of the 19th century.¹ Shortly after Mendel’s work, Friedrich Miescher isolated “nuclein” from lymphocytes noting the uniquely high proportion of phosphorus in the form of phosphoric acid.^2–4 Albrecht Kossel and Albert Neumann furthered Miescher’s work by identifying the four bases and renaming “nuclein” deoxyribonucleic acid.⁵ Walther Flemming first described mitosis (the division of cells), showing the doubling and separation of chromosomes.⁶ Theodor Boveri and Walter Sutton independently discovered meiosis, establishing chromosomes as the vehicle for inheritance (i.e., the “chromosome theory of inheritance”).^7–9

Despite early suggestions of chromatin containing DNA by Kossel and Neumann, many believed proteins and not DNA coded the fundamental information for inheritance. Oswald Avery, Collin MacLeod, and Maclyn McCarty published the first experiments to establish DNA carries the genetic code using Diplococcus pneumoniae.¹⁰ Erwin Chargaff rightly believed Avery et al. and discovered equal proportions of adenine/thymine guanine/cytosine (“Chargaff’s rule”), disproving the tetranucleotide hypothesis and laying the groundwork for the double helical model.¹¹ In the early 1950’s Rosalind Franklin started using X-ray crystallography to study the structure of DNA, producing the first images showing the double helical form.¹² Watson and Crick were given Rosalind’s images without her knowledge or permission, allowing them to perform the final work to establish the structure of DNA.¹³ Crick went on to establish the Central Dogma of Molecular Biology.^14,15

The Central Dogma of Molecular Biology describes how DNA codes for the proteins which build and sustain eukaryotic life. To produce proteins, ribonucleic acid (RNA) polymerase first transcribes the DNA message into single-stranded RNA molecules (messenger RNA, mRNA). After post-transcriptional modifications including possible splicing (reorganization), mRNA is translated into a polymer of amino acids by ribosomal RNA complexes. Amino acid polymers, also known as polypeptides or peptide chains, form the primary structure of proteins. Therefore, modifications to DNA have profound impacts on cellular and organismal function.

1.2.2 Genetic variation in humans

Humans are diploid organisms, meaning we have two copies of each chromosome. Under normal circumstances, we receive one set of chromosomes each from our biological mother and father. Humans have 46 chromosomes (23 from each parent), including 21 autosomes (chromosomes 1-22) and two sex chromosomes (X and Y).¹⁶ The haploid (single copy) genome spans roughly 3.1 billion basepairs, of which roughly 1.5% is predicted to code for protein.¹⁷ Broadly, four classes of variants occur within DNA: (1) single nucleotide substitutions (single nucleotide variants, SNVs), (2) insertions and deletions (indels), (3) copy number variants, and (4) translocations and inversions.

SNVs occur when one base replaces another in a specific sequence. DNA codes for amino acids (the building blocks for protein) using three consecutive bases (a codon). The codon code includes redundancy in the third position (e.g. the sequences GAA and GAG both code for the amino acid glutamic acid). Consequently, synonymous and non-synonymous mutations exist. Synonymous mutations occur in the third position of the codon and do not change the resulting amino acid. Non-synonymous mutations change the resulting protein structure, either by an amino acid substitution (missense mutations), causing a premature stop codon (nonsense mutation), eliminating a start codon (non-start mutation), or eliminating the stop codon (nonstop mutation). SNVs can occur through polymerase errors during DNA replication and mutagenic substances (e.g. specific wavelengths of light, chemical exposure).

The remaining three types of mutations fall into the large category of structural variation (SV), with muddy lines between what constitutes an insertion/deletion (indel) versus copy number variant (CNV). Indels represent small insertions or deletions of genetic material; any indel with length not divisible by three can cause a shift in the reading frame (frameshift mutation) and modify all downstream amino acids. Copy number variants represent larger insertions or deletions and can range from a single exon up to whole chromosomes (aneuploidy). Aneuploidy occurs due to mitotic segregation errors during cellular replication. Smaller copy number variation likely occurs through non-allelic homologous recombination (NAHR), non-homologous end-joining (NHEJ), fork stalling and template switching (FoSTeS), and retrotransposition.¹⁸

We call segmental rearrangements a translocation when the segment moves from one locus to another, and an inversion when the segment gets flipped and reinserted. Rearrangements can cause detrimental effects when they disrupt gene sequence. The same processes creating CNVs discussed above can create rearrangements, and rearrangements often go hand-in-hand with copy number variation.

1.3 DNA Sequencing

1.3.1 First-generation sequencing

Using lessons learned from previous RNA sequencing efforts, the first DNA sequencing techniques arose in the 1970s with Sanger’s original plus-minus approach,¹⁹ the Maxam-Gilbert chemical cleavage approach,²⁰ and Sanger’s chain termination approach.²¹

Maxam-Gilbert sequencing works by cleaving DNA sequences at specific base pairs using specific chemical reactions. Before cleaving, radioactive phosphorus is incorporated into the DNA fragment’s 5’ terminus. The fragment is then cleaved randomly at specific bases in four separate reactions: G, G and A, C, or C and T. The cleaved radio-labeled fragments from each of the four reactions are then size-separated and visualized on a polyacrylamide gel.

Sanger sequencing (chain termination) was the first sequencing by synthesis (SBS) approach. Similar to Maxam-Gilbert sequencing, the target DNA fragment is replicated by the polymerase chain reaction (PCR) in four separate conditions. Each condition contains an equimolar mix of the four deoxynucleotide triphosphate (dNTP, DNA bases) molecules and a small amount of a single radio- or fluorescently-labeled dideoxynucleotide (ddNTP). The PCR reaction cannot proceed after incorporating a ddNTP, so each of the four reactions will contain synthesized fragments that stop at the same base. Again, the four reactions are size-separated and visualized on a polyacrylamide gel.

1.3.2 Second-generation sequencing

In the following decade Nyrèn and Lundin discovered an enzymatic method for detecting the incorporation of a new base during sequencing.²² Pyrophosphate is released when dNTPs are incorporated into a DNA polymer; Nyrèn added two enzymes to the synthesis reaction: (1) ATP sulfurylase, which converts pyrophosphate into ATP; (2) luciferase, which converts ATP molecules into light. After fixing the DNA template to a solid phase, sequencing is performed by watching for light reactions after adding a single base at a time. Pyrosequencing struggles with sequencing over homopolymers (contiguous runs of the same base), with poor performance after 4-5 identical bases.²³

The next significant breakthrough came in the early 2000s when Li et al. developed the first photocleavable fluorescent nucleotide.²⁴ The novel nucleotides use a fluorescent tag to block the 3’ hydroxyl group, which can be cleaved using a specific wavelength of light. This allows for SBS with a “reversible termination” of synthesis after each base incorporation. The reversible terminators, in conjunction with the development of glass-bound colony expansion,²⁵ laid the groundwork for the Solexa system (acquired by Illumina) which currently dominates the sequencing field.²⁶

1.3.3 Illumina sequencing

Illumina sequencing works by creating clusters of identical DNA fragments bound to a glass plate (“flow cell”), then performing SBS using fluorescent reversible terminators. During Illumina sequencing, specific sequencing adapters are ligated onto short DNA fragments to: (1) bind DNA fragments to the flow cell; (2) initiate amplification; (3) optionally identify the fragment source. The flow cell contains a “lawn” of two short oligos bound to the glass surface; the fragments have homology to either the forward or reverse adapter. The sequencing library containing the ligated forward and reverse adapters are added to the flow cell, where they hybridize to the lawn. Once bound, polymerase is added and the bound oligo is extended using the hybridized DNA fragment as a template. The original template is then washed away, leaving complementary sequences bound to the flow cell. The free adapter then bends over to hybridize to its complement oligo, forming a bridge, and polymerase fills in the oligo to form a double-stranded fragment (bridge amplification). The double-stranded fragment is denatured, leaving two single stranded fragments bound to the flow cell. Bridge amplification is repeated until each cluster contains hundreds of the same fragment. The reverse fragments are then cleaved from the flow cell, and the clusters are sequenced by detecting the incorporation of fluorescent reversible terminators. Each cluster is tracked as bases are incorporated, giving the final DNA sequence.

1.3.4 Processing short-read sequencing data

With the advancements in sequencing chemistry, we can now sequence great amounts of DNA cheaply. However, the massively parallel sequencing modalities only sequence small fragments of DNA (typically 50 to 500 basepairs in length), often using a “shotgun” approach – “shotgun” referring to sequencing a randomly fragmented sample rather than a known locus. Therefore, the nature of short-read shotgun sequencing requires robust computational approaches to process and contextualize sequence data for millions of DNA fragments.

Here, I will give an overview of processing sequencing data for a species with an established reference genome. Processing short-read sequencing data follows the following general steps:

pre-processing to remove artificially added sequence (sequencing adapters, sample barcodes, etc.) and create FASTA/FASTQ^27,28 output;
map individual reads to their original location in the reference genome and create Sequence Alignment Map (SAM/BAM)²⁹ output;
optional post-mapping quality control;
variant identification;
variant filtering and interpretation.

The pre-processing step depends entirely on the sequencing chemistry and machinery used. Illumina sequencers produce binary base call (BCL) files containing all raw base calls and quality information from the sequencing run. BCL files contain the adapter sequence (including sample barcode sequence and optional molecular index sequence), which we must remove before mapping. Due to modern sequencer capacity, frequently each lane of the flow cell will contain multiple samples. By convention, reads from each sample are separated into individual FASTQ files. Separating reads by sample must occur before discarding the adapter sequence information. Illumina currently provides the ‘bcl2fastq’ command line tool for performing all of the required tasks to produce sample-specific FASTQ files with molecular index information when applicable.

The process of mapping individual reads (query sequences) to a reference sequence requires (1) finding the correct starting point in the reference sequence, and (2) accounting for substitutions, insertions, and deletions in the query sequence. Smith and Waterman published the first algorithm meeting both requirements, using dynamic programming on a substitution matrix³⁰ based on Needleman and Wunsch;s initial work.³¹ The Smith-Waterman algorithm requires user-defined scores for matches, mismatches, and gaps (insertions/deletions); the algorithm will find the best possible match with the given scoring system, but requires \(O(mn)\) compute time where \(m\) and \(n\) represent the length of the reference and query sequences.

To reduce the problem’s complexity, Altschul et al. developed the basic local alignment search tool (BLAST).³² BLAST works by breaking the query sequences into a hash table of all possible \(k\)-mer sub-sequences and searching the reference sequence for non-gap matches above some threshold. For pairs of matches, BLAST extends the sequence to refine the candidate pool, and then finalizes the best candidates using the Smith-Waterman algorithm. Many other tools take similar hash table approaches, including hashing the reference sequence rather than the query sequence.³³

Modern alignment algorithms have further improved efficiency by exploiting the Burrows-Wheeler Transform (BWT).^33,34 The BWT creates a quickly search-able compressed representation of the reference sequence (roughly 1 gigabyte for the complete human genome), enabling in-memory computation.³⁵ The various BWT-based algorithms differ primarily on how they handle mismatches.³³ For DNA sequencing, the Burrows Wheeler alignment tool (BWA) developed and subsequently refined by Li and Durbin in 2009 remains the de facto industry standard.^29,36

Post-alignment processing prepares the mapped reads for variant calling. Artificial duplicate reads can create bias in downstream variant calling, and deserve careful consideration. Two types of artificial duplicates can occur with Illumina sequencing: (1) PCR duplicates, (2) technical (optical and cluster) duplicates. In large randomly-fragmented libraries sequenced to moderate depth, duplicate reads are much more likely to represent artificial than true duplicates. Virtually all sequencing library preparation protocols include PCR amplification, producing artificial duplicate reads. Using non-patterned flow cells, the image processing software may incorrectly identify large/oddly shaped clusters as two separate clusters. With patterned flow cells, occasionally the same template can “jump” into an adjacent cluster.

In deeply-sequenced libraries with low complexity, we are more likely to observe true read duplicates. Without including unique molecular identifiers (UMIs) in the adapter sequence, we have no way of distinguishing true versus artificial duplicate reads. A UMI is a short (generally 6-12 basepairs) sequence of random bases; all PCR duplicates will contain the same UMI sequence. The exceedingly low probability of two true read duplicates having the same UMI allows properly controlling for artificial duplicates without removing true duplicates.

In addition to removing duplicate reads, the GATK best practices pipeline suggests adjusting the base quality scores before variant calling.^37,38 GATK provides the BaseQualityScoreRecalibration tool, which uses machine learning models to correct known systematic errors in sequencing.

With the final set of aligned reads, we move to identifying deviations (variants) from the reference sequence. Numerous tools exist to perform variant calling; I will discuss the general approaches to calling the different types of variants, highlighting commonly-used algorithms.

Calling single base substitutions – single nucleotide variants (SNVs) – relies fundamentally on counting alleles at each locus. At minimum, the statistical models incorporate the quality of each base call and assumptions about sequencing error rates, e.g. the samtools mpileup/bcftools call programs.³⁹ GATK previously provided a similar tool, implementing a simple Bayesian genotype likelihood model,^37,38,40 but has moved currently to a haplotype-based calling algorithm (HaplotypeCaller).⁴¹ HaplotypeCaller works by (1) identifying “active” regions containing plausible variants, (2) building possible haplotypes in the active regions using de Bruijn-like graphs, (3) assigning haplotype likelihoods to reads, and (4) calculating genotype likelihoods incorporating the estimated haplotype information. The idea for using haplotype estimates in genotype calling originated with the freebayes algorithm.⁴² The above tools all use very similar approaches to call small insertions and deletions (indels). Development continues actively in SNV/indel variant identification, and performance between algorithms predictably differs depending on the nature of a variant.^43,44

Calling larger structural variation from short-read sequencing poses a greater difficulty. SNVs and indels exist within single reads; therefore, we can view and count them directly. We cannot directly view variation which spans lengths greater than our read (or read pair) length. To identify larger variation, calling algorithms attempt to identify some combination of the following signals: (1) relative changes in sequencing depth (read depth); (2) paired read insert size and orientation (paired end mapping); (3) rarey, loss of heterozygosity.

Read-depth methods, e.g. CNVnator,⁴⁵ work by building statistical models utilizing the relative sequencing depth across the genome. The depth bias introduced by the capture step in targeted sequencing necessitates comparing to a set of control samples, e.g. ExomeDepth,⁴⁶ rather than calculating the relative depth across the genome. Paired-end mapping methods identify sets of reads with insert sizes outside a specified range, indicating insertions or deletions, and reads with the incorrect orientation suggesting genomic rearrangements.⁴⁷ The Lumpy algorithm⁴⁸ utilizes both the read depth and paired end mapping approaches for greater detection sensitivity. The ERDS algorithm⁴⁹ combines read depth information with allele ratios when possible.

Sequencing an individual reveals millions of variants compared to the current reference genome,⁵⁰ often requiring filtering to identify meaning results. Multiple public databases now exist cataloging known variants: dbSNP with SNVs and indels,⁵¹ dbGaP⁵² with variants linked to phenotypes, ClinVar⁵³ with clinical variant interpretations, ensembl^54,55 which aggregates data from many sources and provides additional analysis tools (e.g. the Variant Effect Predictor),⁵⁶ and gnomAD⁵⁷ with variants from >100,000 human exome sequences and >15,000 human genome sequences across diverse populations. Most commonly we begin by searching the predicted variants for known pathogenic variants that explain the clinical picture. If the search for known pathogenic variants fails, we can proceed by throwing out common variants and predicting protein-altering variants that correlate clinically. Many tools exist for predicting variant outcome, e.g., the Ensembl Variant Effect Predictor,⁵⁶ PolyPhen,⁵⁸ and JannoVar.⁵⁹

1.4 Development of Medical Genetics

The field of medical genetics arguably began in the first years of the 20th century with Archibald Garrod. In a collaborative effort with William Bateson, the person most responsible for the resurgence of Mendel’s work,⁶⁰ Garrod first-identified a disease (alkaptonuria) that follows a Mendelian inheritance pattern.⁶¹ Garrod, one of the initial pioneers in the biochemistry, went on to characterize the “inborn errors of metabolism” as enzymatic deficiencies in a book bearing the same name⁶² and correctly hypothesized our “individual chemistries” derived from “chromosomes from which we sprang” in 1931.⁶³

During the first half of the 20th century, medical genetics focused on identifying and describing genetic diseases in humans, with nearly all of the molecular and basic science taking place in fruit flies, mice, and corn.^60,64 During this time, Bateson hotly debated Francis Galton (the originator of eugenics) and Karl Pearson on “Mendelism” versus “biometrics.”⁶⁰ In 1918, R.A. Fisher proved Mendelian inheritance could produce the spectrum of variation described by the biometrics proponents in the landmark paper “The Correlation between Relatives on the Supposition of Mendelian Inheritance.”⁶⁵ Rightly overshadowed by the horrific legacy of eugenics, the Treasury of Human Inheritance published in multiple volumes between 1923 and 1958 by Pearson and the Galton Laboratory for National Eugenics described dozens of genetic disorders. Of note, the physician-mathematician Julia Bell contributed more monographs to the volumes than any other individual.

Progress in medical genetics accelerated quickly in the second half of the century, with the development of cytogenetics and molecular biology/genetics. Flemming first visualized chromosomes in 1878,⁶ but we did not know the correct chromosome number in humans until 1956.¹⁶ Advances in cell culture and the use of colchicine,⁶⁶ allowed the first accurate pictures of human chromosomes, leading to the identification of numerous aneuploidy syndromes. Then, in 1970 Caspersson et al. developed a fluorescent technique for banding chromosomes,⁶⁷ with the invention of G-banding for chromosomes developed shortly after by Marina Seabright.⁶⁸ The ease of G-banding brought it into clinical use and greatly expanded cytogenetics’ utility both in gene mapping and diagnostics. In 1990, Fan et al. further expanded the cytogenetics’ utility with the invention of fluorescent in situ hybridization (FISH).⁶⁹

In parallel to cytogenetics advancements, the evolution of molecular biology led to the identification of specific genes and proteins. Major advancements included the isolation of restriction enzymes,⁷⁰ DNA hybridization and the Southern blot,⁷¹ first human gene cloned,⁷² and the development of using restriction fragment length polymorphisms to map genes,⁷³ and the invention of the polymerase chain reaction (PCR) for amplifying DNA without using complicated bacterial cultures.⁷⁴ The molecular genetics field culminated with the completion of the human genome project in 2001, using combinations of first and second generation DNA sequencing (discussed in 1.3).

Medical genetics, clinically, has most impacted pediatrics and obstetrics by implementing screening for metabolic conditions and prenatal genetic testing.

Metabolic screening programs started in 1963 with phenylketonuria (PKU).⁷⁵ Fölling discovered PKU in 1934 and later confirmed its autosomal recessive inheritance in 1945.⁶⁰ In 1953 Jervis identified the specific enzymatic deficiency,⁷⁶ and later that year Bickel et al. demonstrated the effectiveness of a phenylalanine-restricted diet in treating PKU.⁷⁷ The PKU model was extended to other inborn errors of metabolism and now all children in the US get screened for a panel of treatable metabolic conditions at birth.

Prenatal genetics began in 1956 when Fuchs demonstrated fetal sex identification from amniocentesis.⁷⁸ A decade later Steele et al. began culturing amniocytes derived from amniocentesis, opening up the door for wide ranging prenatal genetic testing.⁷⁹ We now perform the full gamut of available genetic testing early in pregnancy, either by amniocentesis or chorionic villus sampling. In addition to direct fetal sampling, researchers have identified circulating fetal lymphocytes,⁸⁰ trophoblasts,⁸¹ and fetal nucleated red blood cells⁸² in pregnant mothers’ circulation. In 1997 Dennis Lo et al. identified cell-free DNA (cfDNA) in maternal circulation,⁸³ which has, so-far, proven more diagnostically exploitable than isolating fetal cells. We can now detect many genetic disorders from cell-free fetal DNA.⁸⁴

1.5 Rationale

The work presented here addresses two issues at the forefront of medical genetics: (1) we still have very limited information on the prevalence and clinical significance of small exon-level copy number variants; (2) no one has adequately performed fetal genotyping solely from maternal cell-free DNA.

Due to both cost and the overbearing number of variants with unknown significance obtained with genome sequencing, most clinical sequencing efforts focus on targeted sequencing (either whole-exome or specific gene panels). Targeted sequencing introduces large exon-to-exon read depth variability, making CNV detection from targeted sequencing difficult. The majority of read-depth variability comes from the differential efficiency of the oligonucleotide baits used to capture specific genes or regions. I show multiplexing the capture step in targeted sequencing greatly reduces sample-to-sample read depth variance, increasing power to detect copy number variation. Additionally, I present a novel statistical framework and R package for estimating copy number from multiplexed capture data.

Despite great advances in non-invasive prenatal testing, noninvasive fetal genotyping without additional parental sequencing remains elusive. Others have suggested the possibility of noninvasive exome sequencing. I describe the statistical limitations impeding noninvasive fetal genotyping and demonstrate why noninvasive exome sequencing does not make sense clinically. I also present a novel algorithm implementing an empirical Bayesian approach to estimating the fetal fraction and maternal-fetal genotypes.