LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes
BioData Mining volume 16, Article number: 3 (2023)
Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants.
We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent.
LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK.
Loss-of-function (LoF) variants are determined to have a critical effect on gene function by inactivating protein-coding genes . Remarkably, recent analyses of the human genome have uncovered that individuals harbor many dozens of LoF variants, including stop-gained, frameshift variants and splice site disruptions [2, 3]. On average, LoF variants are deleterious, and thus usually tend to be found at very low frequencies in the human population. These variants can have a profound impact on the gene transcripts and translated proteins. The association of LoF variants with complex diseases and phenotypic traits may lead to the discovery and validation of novel therapeutic targets . However, hundreds of LoF variants are functionally neutral with no detectable influence on phenotypes [4, 5].
Several difficulties emerge when evaluating LoFs on a broad scale. False positives in the prediction of LoF variants can arise due to artifacts that may occur during genotype calling, mapping, imputation and annotation . To annotate high-confidence (HC) LoF variants only, Loss-Of-Function Transcript Effect Estimator (LOFTEE)  can be used. LOFTEE is a plugin implemented in the Ensembl Variant Effect Predictor (VEP)  that imposes stringent filtering criteria to annotate HC LoF variants, eliminates nonsense mutations that are unlikely to impact protein function, and excludes LoF variants that are enriched with annotation artifacts.
However, LoF variants discovery can also be used to predict single-copy losses (heterozygous LoF variants) that inactivate a single copy of a gene, or two-copy losses that completely knockout a gene. Two-copy losses can be caused by homozygous and compound heterozygous (CH) LoF variants. CH variants appear when parents both donate a LoF-causing allele that locates at different loci in the same gene . There is mounting evidence that CH LoF variants have a role in complex diseases. For example, both homozygous and CH LoF variants have been found to increase the risk of autism spectrum disorder [9, 10].
Current tools, such as LOFTEE and ALoFT, only annotate LoF variants and provide variant-level output [6, 11]. They do not identify genes and distinguish between single-copy and two-copy loss genes. Furthermore, the collection of available tools to identify and annotate LoF variants require in-depth computational skills impeding the usage by scientists less skilled in bioinformatics. As far as we are aware, no user-friendly, automated bioinformatics pipeline exists to identify CH LoF variants, and single-copy and two-copy LoF genes, and that also provides the necessary input for downstream (association) analyses. The development of a bioinformatics pipeline that automatically parses the VEP-LOFTEE result files in a single workflow to the input required for downstream analyses, would democratize the use of LoF data to a broader range of biomedical scientists that would only require limited bioinformatic skills.
Here we present an open source tool, the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants and identifies genes that are inactive in one or two copies using genetic data derived through array-based genotyping imputed or whole-genome sequencing. LoFTK analyzes and parses genetic data in four steps as explained in the Implementation and depicted in Fig. 1; 1) Annotation of HC LoF variants from large-scale sequencing and array-based data using VEP and LOFTEE; 2) Identification of one-copy loss and two-copy loss of genes by parsing the CSQ field for each HC LoF variant in the VCF file as generated by VEP-LOFTEE; 3) Generation of a summary data for LoF-wide association analyses; and 4) Creating a statistical report describing the total number of LoF variants (homozygous, heterozygous and CH), LoF genes (single-copy and two-copy loss), and the average, minimum and maximum numbers of LoF variants and genes per sample. LoFTK aids to bridge the divide between computational scientists and wet-lab based trained biomedical scientists by simplifying the processing of VCF-based data to a useful format for downstream analyses in statistical tools like R.
The LoFTK workflow consists of 4 analytical steps visualized in Fig. 1 and described below.
Preprocessing: conversion of IMPUTE2 to VCF
The first step depends on the input data formats. Two common file formats are permitted as inputs; IMPUTE2 [12, 13] output format and the Variant Call Format (VCF). The input data has to contain phased genotypes for distinguishing compound heterozygotes from two variants on the same allele. LoFTK uses two quality metrics for imputed genotypes: the imputation quality (info score) and imputed allele probability. The imputation quality contains values between 0 and 1, where higher values mean that a variant has been imputed with more certainty. Besides, imputation methods generate a probabilistic prediction of the missing genotypes, which stands for the likelihood of carrying genotypes combinations of A/A, A/B, and B/B in a particular individual. The supreme estimated genotype is the genotype that has the highest likelihood of being correct . LoFTK has cut-off options to filter based on the optimal imputation quality metrics (Supplementary Material). After filtering, IMPUTE2 files are converted to VCF files. The VCF files that are generated from IMPUTE2 files or introduced directly by the user are applied as an input to the next step.
The second step consists of annotation of LoF variants using VEP and LOFTEE. LOFTEE utilizes the Ensembl API framework to annotate HC LoF variants. LoFTK is highly customizable, with the ability to change VEP and LOFTEE flags in a configuration file. We designed LoFTK to be capable of processing data with Homo sapiens (human) genome assemblies GRCh37 and GRCh38, and it can easily be upgraded to future genome builds. The VEP will return results as VEP VCF format, which is similar to the input VCF, but in addition shows LoF information in the INFO field, such as LoF flags and LoF filtering outcome (high-confidence or low-confidence).
Calculation of LoF variants and genes
From the VEP output, the HC LoF variants are filtered, followed by parallel determination of homozygous and heterozygous LoF variants (Table 2) and allele frequencies, as well as the copy number loss (single-copy or two-copy) of LoF genes (Table 1). LoFTK recognizes CH LoF variants, which result in LoF genes with two copies losses. The LoF genes are extracted by parsing the CSQ field for each HC LoF variant in the VCF file that produced from the VEP. Optionally, LoFTK can be used to determine ‘mismatched genes’ between samples; these are genes that are active in one or two copies in one sample and completely inactive in the other sample. This feature helps study interactions between human genomes, for instance during pregnancy (maternal vs fetal genome) and after stem cell or solid organ transplantation (donor vs recipient genome).
Finally, descriptive statistics of LoF variants are calculated, such as the total number of LoF variants, number of single-copy and two-copy LoF genes, and median of LoF variants per participant.
Imputation quality threshold
The imputed genotype data provides two quality metrics: the INFO score and the imputed alleles probability. UK Biobank (UKBB) was used as the gold standard for determining the optimal quality metrics for obtaining the most genuine LoF variants from imputed genotypes data. We retrieved whole exome sequencing (WES) and array genotypes data from 4,476 randomly selected UKBB participants. Both data were phased using SHAPEIT2  and array genotypes were imputed by IMPUTE2 [12, 13]. A combined reference panel from the 1000 Genome project phase 3  and Genome of the Netherlands (GoNL) study  was used for phasing and imputation. We used LoFTK for LoF analysis in phased WES and three datasets of variants in imputed genotypes data. These subsets were divided based on variants with INFO scores above: 0.3, 0.6, 0.9. for each individual, predicted LoF variants in the WES were compared to LoF variants in each subset with considering the imputed allele probabilities ranging from 0.01 to 0.1 for that variant (Supplementary Fig. 1), in order to count the number of false negatives (average of LoF variants predicted in WES data but not in imputed data) and false positives (average of LoF variants predicted in imputed data but not in WES data).
Validity of predicted CH LoF in trios
LoFTK is capable of annotating CH LoF variants, which introduce two inactive copies of a gene. To confirm the transmission of genuine CH LoF variants from parents to probands, we used trio-family genotype data from the Genome of the Netherlands (GoNL) cohort (Illumina Immunochip microarray SNP data) . We performed a quality control step as preprocessing filtrations to impute genotypes data (Supplementary Material). We used the TOPMed imputation server  to impute untyped variants in 760 individuals from 250 families. LoFTK predicted LoF variants from imputed genotypes and we investigated transmission of CH LoF variants from parents to offspring.
Exome sequences in UKBB
UKBB data were made available under the North West Multi-centre Research Ethics Committee (reference 11/NW/0382). UKBB data used in this study were obtained under application number 24711. We applied LoFTK on exome sequences of 200,643 UKB participants that were released in October 2020 . We filtered participants exome data restricted to unrelated homogeneous white British population (Field IDs: 22,018 and 22,006) and removed singleton variants (MAC < 2). We phased exomes genotypes of 166,991 homogeneous white British participants using Eagle2  followed by LoFTK analysis in order to identify CH knockout genes. Several genes were selected as positive controls that known to be associated with specific traits in UKBB . We tested the association between these LoF genes and traits using linear regression for quantitative data and logistic regression for binary data with age, sex, and principal components 1–16.
Results and discussion
LoFTK is a command-line tool that provides a robust computational workflow pipeline for predicting LoF variants from array-based (genotyped or imputed) and sequenced genomes, discovering genes that are inactive in 1 or 2 copies. LoFTK was developed using Perl and BASH scripting languages which make the code easily understandable, modifiable and extendable when needed. Instructions on how to install and run LoFTK as well as example datasets are publicly available at https://github.com/CirculatoryHealth/LoFTK. The code-setup of LoFTK is such that it is highly customizable through options and directories settings explained in the LoF.config file and GitHub README. It is designed to run as a command line program with user-friendly flags, which helps non-experts users to get quickly familiarized. LoFTK requires pre-installed tools, including BASH and Perl (> = version 5.10.1) which are commonly installed on Linux-based system, and the more specialized tools Ensembl VEP (https://github.com/Ensembl/ensembl-vep) and LOFTEE (https://github.com/konradjk/loftee) which both come with extensive installation documentation. We tested LoFTK on a computer cluster using CentOS 7 and managed by SLURM or SGE.
Generation of LoF variants and genes
LoFTK uses the information present in large-scale sequencing and genotyping data to generate four files; two matrices of LoF variants and their respective genes, a list of LoF variants allele frequencies, and a report with descriptive statistics on the variants and genes. In the LoF variants matrix, the variants are represented as rows, and individuals are represented as columns. Each matrix's cell contains a number that represents the homozygous or heterozygous status of a given LoF variant for a given individual as shown in Table 1. Similarly, the columns in the LoF genes matrix define individuals except the rows represent the LoF genes, and each number in the matrix cell indicates that either the gene has no copy loss (0), single-copy loss (1) or two-copy loss (2) (Table 2). The frequencies in both matrices represent the frequency of heterozygous and homozygous LoF variants among the provided samples (Table 1), as well as the frequency of one-copy and two-copy LoF genes (Table 2). Finally, LoFTK generates information file with “.info” extension to show descriptive statistical report for predicted LoF variants and genes, such as the total LoF variants and genes, total heterozygous and homozygous LoF variants, total single-copy and two-copy LoF genes, and median of LoF variants and genes per participant.
Imputation quality cut-off points
We assessed imputation quality metrics for obtaining the most genuine LoF variants in imputed genotype data by comparing existence of each predicted LoF variant between WES and three imputed datasets (INFO > 0.3, 0.6, 0.9) with considering the imputed allele probability cutoffs between 0.01 to 0.1 (see Sect. 2.2.).
LoFTK analysis for imputed dataset with INFO > 0.9 shows an optimal prediction of true LoF variants, because it has less false positive 2-copy LoF variants compared to the others (0.3 and 0.6) (Supplementary Fig. 1). However, choosing an optimal imputed allele probability was difficult due to the lack of apparent variations.
CH LOF variants in trios
CH LoF variants occur when both parents donate a single LoF allele to proband at distinct loci within the same gene. We used trio-families from the GoNL  to evaluate the accuracy of obtaining two inactive copies in genes caused by CH LoF variants (see Sect. 2.3.).
We predicted LoF variants and genes in 250 families' imputed genotypes (760 individuals). We found 642 LoF variants affecting 571 genes (Table 3). In 164 probands, we identified 250 events of CH LoF variants producing 2-copy LoF genes. There were 240 (96%) true transmissions of CH LoF in parent-offspring, whereas there were 10 false transmissions.
LoF variants and genes in ~ 200 K exomes from UKBB
We predicted LoF variants and genes in unphased exomes of 200,643 participants (mixed populations). We identified 398,377 heterozygous LoF variants affecting 17,796 genes and 2,383 homozygous LoF variants affecting 1,798 genes. Next, to determine CH variants, we phased exomes of 166,991 homogenous, unrelated white British individuals and found 16,464 1-copy LoF genes and 1,510 2-copy LoF genes. Of the 2-copy LoF genes, 481 were caused by homozygous variants only, 307 by CH variants only, and 722 by both homozygous and CH variants. To prove that we correctly identified homozygous and CH LoF genes, we examined nine LoF genes (as positive controls) that are known to be associated with specific traits. All of them showed a significant association and the expected direction of effect (Supplementary Table 2).
Some limitations of the current LoFTK version should be considered: LoFTK relies on preexisting methods for phasing, imputation, genotype calling, and variant effect prediction, which means results can be affected by errors generated by these software. Errors rate varies from one sequencing platform to another in the variant calling step, making it difficult to predetermine error rates. Lastly, predicting LoF variants and genes from unphased data will not allow the detection of CH LoF variants, which means users will have to input phased data to make full use of LoFTK.
Prediction of LoF variants and genes provide important insight into the discovery of possible disease-causing mutations and potential therapeutic targets. LoFTK is easy to use and helps users to predict LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies, and providing summary statistics report describing the total number of LoF variants, LoF genes, and their average, minimum and maximum per sample. LoFTK is highly customizable and extra features for the identification of knockout genes in copy number variation (CNV) and predicting the pathogenicity of predicted LoF variants can be easily added.
Availability of data and materials
The data used in this article were provided by the UK Biobank under application (#24,711) and the Genome of the Netherlands under application number (#2,021,217). Access to these data can be achieved by request from UK Biobank (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access) and the Genome of the Netherlands (https://www.nlgenome.nl/menu/main/app-request).
Loss-Of-Function Transcript Effect Estimator
Ensembl Variant Effect Predictor
Variant Call Format
Genome of the Netherlands
Whole exome sequencing
Copy number variation
MacArthur DG, Tyler-Smith C. Loss-of-function variants in the genomes of healthy humans. Hum Mol Genet. 2010;19:R125–30.
Balasubramanian S, Habegger L, Frankish A, MacArthur DG, Harte R, Tyler-Smith C, Harrow J, Gerstein M. Gene inactivation and its implications for annotation in the era of personal genomics. Genes Dev. 2011;25:1–10.
MacArthur DG, Balasubramanian S, Frankish A, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–8.
Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.
Lek M, Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–91.
Karczewski KJ, Francioli LC, Tiao G, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–43.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17:122.
Kamphans T, Sabri P, Zhu N, Heinrich V, Mundlos S, Robinson PN, Parkhomchuk D, Krawitz PM. Filtering for compound heterozygous sequence variants in non-consanguineous pedigrees. PLoS ONE. 2013;8: e70151.
Yu TW, Chahrour MH, Coulter ME, et al. Using whole-exome sequencing to identify inherited causes of autism. Neuron. 2013;77:259–73.
Lim ET, Raychaudhuri S, Sanders SJ, et al. Rare complete knockouts in humans: population distribution and significant role in autism spectrum disorders. Neuron. 2013;77:235–42.
Balasubramanian S, Fu Y, Pawashe M, McGillivray P, Jin M, Liu J, Karczewski KJ, MacArthur DG, Gerstein M. Using ALoFT to determine the impact of putative loss-of-function variants in protein-coding genes. Nat Commun. 2017;8:382.
Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5: e1000529.
Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 2012;44:955–9.
Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11:499–511.
Delaneau O, Marchini J, Zagury J-F. A linear complexity phasing method for thousands of genomes. Nat Methods. 2011;9:179–81.
Boomsma DI, Wijmenga C, Slagboom EP, et al. The Genome of the Netherlands: design, and project goals. Eur J Hum Genet. 2014;22:221–7.
Taliun D, Harris DN, Kessler MD, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–9.
Szustakowski JD, Balasubramanian S, Kvikstad E, et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat Genet. 2021;53:942–8.
Loh P-R, Danecek P, Palamara PF, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 2016;48:1443–8.
Backman JD, Li AH, Marcketta A, et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021;599:628–34.
This study was carried out utilizing the UK Biobank Resource under application number . This study makes use of data generated by the Genome of the Netherlands Project. Funding for the project was provided by the Netherlands Organization for Scientific Research under award number , dated (July 9, 2009) and made available as a Rainbow Project of the Biobanking and Biomolecular Research Infrastructure Netherlands (BBMRI-NL). Samples where contributed by LifeLines (http://lifelines.nl/lifelines-research/general), The Leiden Longevity Study (http://www.healthy-ageing.nl; http://www.langleven.net), The Netherlands Twin Registry (NTR: http://www.tweelingenregister. org), The Rotterdam studies, (http://www.erasmus-epidemiology.nl/rotterdamstudy) and the Genetic Research in Isolated Populations program (http://www.epib.nl/research/geneticepi/research.html#gip). The sequencing was carried out in collaboration with the Beijing Institute for Genomics (BGI). We are thankful for the support of the ERA-CVD program ‘druggable-MI-targets’ (grant number: 01KL1802), the EU H2020 TO_AITION (grant number: 848146), and the Leducq Fondation ‘PlaqOmics’.
Availability and requirements
Project name: LoFTK.
Project home page: https://github.com/CirculatoryHealth/LoFTK
Operating system(s): Linux.
Programming language(s): Bash, Perl > = 5.10.1
Other requirements: VEP, LOFTEE, samtools.
License: International Public License.
Any restrictions to use by non-academics: none.
This work was supported by National Institutes of Health [LM010098] from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (n° 116074). AA is supported by King Abdullah International Medical Research Center (KAIMRC). JvS is supported by Dutch Heart Foundation grants [2017T003] and [2019T045]. SWvdL is funded through grants from the Netherlands CardioVascular Research Initiative of the Netherlands Heart Foundation (CVON 2011/B019 and CVON 2017–20: Generating the best evidence-based pharmaceutical targets for atherosclerosis [GENIUS I&II]). The funders had no role in the design and conduct of this study.
Ethics approval and consent to participate
UK Biobank (UKBB) had acquired ethics approval from the North West Multi-centre Research Ethics Committee, which covers the UK (approval number: 11/NW/0382) and had obtained informed consent from all participants. UKBB data used in this study were obtained under application ID 24711. Collection of the Genome of the Netherlands (GoNL) data was approved by the committee of GoNL and all necessary participant consent has been obtained. GoNL data used in this study were obtained under application ID 2021217.
Consent for publication
All participants provided written informed consent for their anonymized information to be used for health-related research purposes and the results thereof to be published.
The authors declare that they have no competing interests. Dr. Sander W. van der Laan has received Roche funding for unrelated work.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Table 1. False positive and false negative values from matching the LoF variants between exome and subgrouped imputed genotypes data.Supplementary Table 2. Association of selected positive control LoF genes with UKBB phenotypes. Supplementary Figure 1. The workflow for achieving the optimal imputation quality measure.
About this article
Cite this article
Alasiri, A., Karczewski, K.J., Cole, B. et al. LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes. BioData Mining 16, 3 (2023). https://doi.org/10.1186/s13040-023-00321-5
- Loss-of-Function variants
- Knockout genes
- Compound heterozygotes
- Human genetic