Background: Although the American College of Medical Geneticsand Genomics/Association for Molecular Pathology (ACMG/AMP)guidelines for variant interpretation are used widely in clinical genetics, there is room for improvement of these knowledge-basedguidelines.
Methods: Statistical assessment of average deleteriousness of start-lost, stop-lost, and in-frame insertion and deletion (indel) variants andextraction of deleterious subsets was performed, being informed byproportions of rare variants in the general population of the GenomeAggregation Database (gnomAD). A machine learning-based modelscoring the pathogenicity of start-lost variants (the PoStaL model) wasconstructed by predicting possible translation initiation sites on tran-scripts by deep learning and training a random forest on known patho-genic and likely benign variants.
Findings: The proportion of rare variants was highest in stop-lost vari-ants, followed by in-frame indels and start-lost variants, suggestingthat the criteria in the ACMG/AMP guidelines assigning PVS (patho-genic very strong) to start-lost variants and PM (pathogenic moderate)to stop-lost and in-frame indel variants would not be appropriate.Regarding deleterious subsets, stop-lost variants introducing exten-sions of more than 30 amino acids and in-frame indels computationallypredicted to be damaging are enriched for rare and known pathogenicvariants. For start-lost variants, wedeveloped the PoStaL model, whichoutperforms existing tools. We also provide comprehensive lists of thePoStaL scores for start-lost variants and the length of extended aminoacids by stop-lost variants.
Conclusions: Our study could contribute to refinement of the ACMG/AMP guidelines, provides resources for future investigation, and pro-vides an example of how to improve knowledge-based frameworks bydata-driven approaches.