Estimating sequencing error rates
Our method estimates nine different error rates for each individual, as shown in Fig. 11. Family data allows us to detect some sequencing errors because they produce nonMendelian observations in the family, as shown in Fig. 1. By modelling the frequency of these nonMendelian observations, we can estimate perindividual error distributions and estimate the total number of sequencing errors in the dataset.
Let \(C_{g}^{(i)}\) be a random variable representing the observed variant call for individual i at a biallelic site with groundtruth genotype g∈{0/0, 0/1, 1/1}. Sequencing errors can cause \(C_{g}^{(i)} \neq g\), so our goal is to estimate the distribution of \(C_{g}^{(i)}\) within a genomic dataset. Specifically, we would like to estimate \(P\left (C_{g}^{(i)} = c\right)\) with c∈{0/0, 0/1, 1/1,./.} for all g, c, and i. The./. observation represents a site where the variant caller was unable to assign a genotype to the individual. By modeling these missing sites, we are able to estimate the rate of missing data for each individual while we estimate the other error rates. Here we make three main assumptions in order to simplify modelling:

1
We assume sequencing errors are rare, so \(P\left (C_{g}^{(i)} \neq g\right)\) is very small.

2
We assume that all observations of Mendelian errors in a family are the result of sequencing error. This may not be true in the case of de novo variants or variants falling within inherited deletions, duplications, or other structural variants. However, we expect this assumption to hold over the majority of the genome.

3
We assume each sequencing error occurs independently in different family members, so the chance of observing multiple sequencing errors at the same site within the same family is vanishingly small. This may not be true in repetitive or otherwise hardtosequence regions, but we expect these special cases to be infrequent.
We define a family genotype as a tuple of genotypes, representing the genotypes of a mother, father, and their child(ren), respectively, at a given site. For example (0/0, 0/1, 0/1, 0/0) is a family genotype for a family of four where the mother is homozygous reference, father heterozygous, first child heterozygous, and second child homozygous reference. Some family genotypes are valid, meaning they contain no missing genotypes and obey Mendelian inheritance. Let \(\mathcal {V}\) represent the set of valid family genotypes and let \(\mathcal {W}\) represent the set of invalid family genotypes. For example, (0/0, 0/1, 0/1, 0/0) is valid. However, (0/0, 0/0, 0/1, 0/0) is invalid because both parents are homozygous reference, but one of the children has a variant.
We can represent any sequencing dataset as a set of family genotypes. Let x_{j} represent the groundtruth number of occurrences of family genotype j, if we could sequence perfectly without any sequencing error or missing data. We do not have access to x_{j}. Instead, we have access to y_{j}, the number of times we observe family genotype j in our dataset, in the presence of sequencing error and missing data. Since we assume that all sites obey Mendelian inheritance, for all invalid family genotypes \(w \in \mathcal {W}, x_{w} = 0\). However sequencing error may cause y_{w}>0.
Let p_{v→w} represent the probability that sequencing errors cause valid family genotype v to be observed as invalid family genotype w. We model Y_{w}, a random variable representing the number of times we observe the invalid family genotype w, using Y_{w} to denote a random variable and lowercase y_{w} to denote a realization of that random variable (in this case, our observations). Assuming sequencing errors are rare, we can apply a generalization of Le Cam’s theorem [27] to show that the Y_{w}s, as sums of multinomials, are approximately distributed as independent Poissons.
$$ Y_{w} \sim \text{Pois}\left(\sum\limits_{v \in \mathcal{V}} x_{v} p_{v \rightarrow w} \right) $$
The error of the approximation is bounded by \(2\sum _{v \in \mathcal {V}} x_{v} \delta _{v}^{2}\) where δ_{v} is the probability of a sequencing error occurring at a site with family genotype v. Since sequencing errors are rare, we expect δ_{v} to be very small for all v, so the approximation is quite good.
We would like to use our Poisson approximation to develop a maximum likelihood estimate for each \(P\left (C_{g}^{(i)} = c\right)\). Since we assume that the chance of multiple errors occurring at the same site within the same family is vanishingly small, p_{v→w}≠0 only if v and w differ for only a single family member. In this case, we call v and w neighbors. Every pair of neighboring genotypes has a corresponding \(P\left (C_{g}^{(i)}=c\right)\) where i is the index of the family member that has different genotypes in v and w, g is the genotype of family member i in v, and c is the genotype of family member i in w. For example, family genotype (0/0,0/0,0/1,0/0) has only three valid neighbors: (0/0,0/1,0/1,0/0),(0/0,0/0,0/0,0/0), and (0/1,0/0,0/1,0/0). Y_{(0/0,0/0,0/1,0/0)} is therefore distributed as:
$$\begin{array}{*{20}l} \text{Pois} \left(\left[\begin{array}{ccc} x_{(0/0, 0/1, 0/1, 0/0)} \\ x_{(0/0, 0/0, 0/0, 0/0)} \\ x_{(0/1, 0/0, 0/1, 0/0)} \\ \end{array}\right]^{T} \left[\begin{array}{cccc} P\left(C_{0/1}^{(0)}=0/0\right) \\ P\left(C_{0/1}^{(1)}=0/0\right) \\ P\left(C_{0/0}^{(2)}=0/1\right) \\ \end{array}\right] \right) \end{array} $$
We do not have access to x_{v}, the ground truth number of occurrences of valid family genotype v. However, since sequencing errors are rare, we assume most valid family genotypes are observed correctly, so we can use y_{v} as an approximation of x_{v}. Since our model is linear in the parameters of interest, Poisson regression will produce a maximum likelihood estimate of each \(P\left (C_{g}^{(i)}=c\right)\).
Limitations on estimating error rates in parents
Certain sequencing errors in parents will never produce an invalid family genotype. For example, if we want to understand the probability of observing a heterozygous variant call in a parent when the underlying genotype is homozygous reference, our method immediately runs up against a problem. This type of 0/0→0/1 error in a parent will never result in an invalid family call, because regardless of whether the parent is heterozygous or homozygous alternate, all of her children may inherit the reference allele. Our method therefore cannot be used to estimate 0/0→0/1 or 1/1→0/1 errors in parents. However, it can estimate these error rates for children. Throughout the paper, we report error rate distributions in children only.
Estimating the expected numbers of errors
Once we have an estimate of the probability of a particular type of sequencing error, we can calculate the expected number of errors of this type. Let \(P\left (C_{g}^{(i)}=c\right)\) represent the rate of observing variant call c at a site with genotype g in individual i. Let \(W_{g \rightarrow c}^{(i)}\) represent the number of times individual i was observed to have variant call c at a site with true genotype g, then
$$\begin{array}{*{20}l} E\left[W_{g \rightarrow c}^{(i)}\right] = x_{g}^{(i)} P\left(C_{g}^{(i)}=c\right) \end{array} $$
where \(x_{g}^{(i)}\) is a count of the number of sequenced sites where individual i has genotype g (for example \(x_{0/0}^{(i)}\) represents the number of sites where individual i is homozygous reference). Our data contains sequencing errors, so we do not know \(x_{g}^{(i)}\), but since we expect error rates to be small, we can use the number of times we observe individual i to have genotype g as a good estimate.
Estimating precision and recall
Precision is the fraction of observed variants that are real, calculated with \(\frac {TP}{TP+FP}\) (where TP represents true positives and FP represents false positives). Recall is the fraction of real variants that are observed, calculated with \(\frac {TP}{TP+FN}\) (where TP represents true positives and FN represents false negatives). We can use these formulas along with our estimates of expected number of errors \(\left (W_{g \rightarrow c}^{(i)}\right)\) to estimate precision and recall at heterozygous and homozygous alternate sites for each individual i.
$$\begin{array}{@{}rcl@{}} \text{Precision}^{(i)}_{0/1} &=& \frac{E\left[W_{0/1 \rightarrow 0/1}^{(i)}\right]}{\sum_{g \in G} E\left[W_{g \rightarrow 0/1}^{(i)}\right]} \\ \text{Precision}^{(i)}_{1/1} &=& \frac{E\left[W_{1/1 \rightarrow 1/1}^{(i)}\right]}{\sum_{g \in G} E\left[W_{g \rightarrow 1/1}^{(i)}\right]} \\ \text{Recall}^{(i)}_{0/1} &=& \frac{E\left[W_{0/1 \rightarrow 0/1}^{(i)}\right]}{\sum_{c \in C} E\left[W_{0/1 \rightarrow c}^{(i)}\right]} \\ \text{Recall}^{(i)}_{1/1} &=& \frac{E\left[W_{1/1 \rightarrow 1/1}^{(i)}\right]}{\sum_{c \in C} E\left[W_{1/1 \rightarrow c}^{(i)}\right]} \\ \end{array} $$
where G={0/0, 0/1, 1/1} is the set of possible true genotypes and C={0/0, 0/1, 1/1./.} is the set of possible variant call observations.
Sequencing dataset details
Several sequencing datasets were used throughout the paper. Here we provide detailed information on the sequencing pipelines used to generate this data.

iHART WGS Whole genome sequencing data from iHART [28], a dataset of multiplex autism families, containing 886 families and 3,943 individuals. Individuals were sequenced at 30x coverage using Illumina’s TruSeq Nano library kits, reads were aligned to build GRCh38 of the reference genome using bwamem, and variants were called using GATKv3.4. Only biallelic variants that pass GATK’s Variant Quality Score Recalibration (VQSR) were included in analysis. This dataset contains 63,206,842 informative sites, where an informative site is defined as a biallelic SNP where at least one individual in the dataset has a nonreference genotype.

iHART WGS v3.2 Whole genome sequencing data from iHART, sequenced as described for iHART WGS. Reads were aligned to build GRCh37 of the reference genome using bwamem, and variants were called using GATKv3.2. Only biallelic variants that pass GATK’s Variant Quality Score Recalibration (VQSR) were included in analysis. This dataset contains 75 sets of twins which are used to validate our algorithm. This dataset contains 47,890,555 informative sites.

iHART WGS v3.4 Whole genome sequencing data from iHART, sequenced as described for iHART WGS. Reads were aligned to build GRCh37 of the reference genome using bwamem, and variants were called using GATKv3.4. Only biallelic variants that pass GATK’s Variant Quality Score Recalibration (VQSR) were included in analysis. This dataset contains 68,963,585 informative sites.

SSC WGS Whole genome sequencing data from SSC [29], a dataset of simplex autism families containing 519 families and 2075 individuals. Individuals were sequenced at 30x coverage using an Illumina PCRfree library protocol, reads were aligned to build GRCh38 of the reference genome using bwamem version 0.7.8 and variants were called using GATKv3.5. Only biallelic variants that pass VQSR were including in analysis. This dataset contains 44,469,152 informative sites.

SPARK WES Whole exome sequencing data from SPARK [30], a crowdsourced dataset of simplex and multiplex autism families, containing 5,903 families and 13,906 individuals. Exome capture was performed using VCRome. Samples were sequenced at 20x coverage. Reads were aligned to build GRCh37 of the reference genome with bwa and variants were called using both FreeBayes and GATK. Only biallelic variants within the VCRome target regions with allele quality greater than 30 were included in analysis. This dataset contains 47 sets of twins which are used to validate our algorithm. This dataset contains 2,511,175 informative sites.

SPARK Array Illumina HumanCoreExome microarray data for 550K SNP sites from SPARK [31], containing 3,239 families and 13,248 individuals. This dataset contains 47 sets of twins which are used to validate our algorithm. This dataset contains 588,046 informative sites.
Validating sequencing error rate estimates using monozygotic twins
We represent a pair of twins as individuals A and B. Let M_{a,b} be a random variable representing the number of observations where twin A has variant call a and twin B has variant call b such that a≠b and a,b∈{0/0, 0/1, 1/1,./.}. In order to model M_{a,b}, we make a couple of assumptions:

1
We assume that the probability of observing sequencing errors in different family members at the same site is very small.

2
We assume that mismatches between twin pair A and B are caused by sequencing errors. While de novo mutations may also cause mismatches, the de novo mutation rate is around 10^{−8} per generation [32], and the current sequencing error rates are closer to 10^{−5} [20]. Thus we assume that most mismatches between twins are due to sequencing errors, not to de novo mutations.
Under these assumptions, we can model the expected number of M_{a,b} mismatches as the expected number of times twin A has a b→a error plus the expected number of times twin B has a a→b error.
$$\begin{array}{*{20}l} E\left[M_{a, b}\right] = E\left[W_{b \rightarrow a}^{(A)}\right] + E\left[W_{a \rightarrow b}^{(B)}\right] \end{array} $$
We then compare these estimates to the observed number of mismatches between twin pair A and B in our dataset. Our method relies on family data to estimate error rates, so to produce as fair a comparison as possible, we estimate error rates for each twin separately, using only nontwin family members.
WGS variant calling in lowcomplexity regions
Nextgen sequencing is known to struggle in lowcomplexity regions (LCR). In order to examine performance in these regions as compared to the rest of the genome, we used the lowcomplexity regions described by [20] and generated by the mdust program. In GRCh37, 2.0% of the genome is considered lowcomplexity. We considered all other genomic regions to be highcomplexity regions (HCR). We estimated variant calling performance in both the LCR and HCR by restricting our method to only consider variants within LCRs or HCRs respectively, using the same set of samples.
WGS versus microarrays at diseaseassociated sites
To investigate performance at diseaseassociated SNPs, we estimated error rates for the WGS datasets restricted to sites included in GWAS Catalog [33]. We used liftover from the UCSC Genome Browser [34] to transfer GWAS Catalog sites from grch38 coordiantes to grch37 coordinates. We then compared the performance of our WGS datasets at these sites to the performance of our microarray datasets.