Diversifying Genomes Seminar

Written by Rachel Creager

03/09/2019

Thanks to volunteer Rachel Creager for her writeup of our recent seminar!

On Friday, March 1, Rachel Sherman, a Ph.D. student at the Johns Hopkins University Center for Computational Biology, presented her work titled Diversifying Genomics: Identifying large variations in genomes of African ancestry individuals. Rachel discussed the history of the Human Genome Project and the race to produce the first full sequence of a human’s DNA. Because two organizations were racing to get publish the full genome before their competition, ~70% of the reference genome produced by the Human Genome Project is comprised of one individual.

Though humans are thought to be ~99.9% similar, the differences at the genomic level are the most interesting and important part. Therefore, when a new genomic sequence has been collected, the first step of the analysis process is to compare the DNA sequence to the reference genome. However, there are large leftover sequences that do not align to the reference genome and are thus thrown out. Rachel looked at those leftover sequences from a cohort of 900 individuals of African ancestry to see if any information could be collected from that “trash” data. After extensive analysis to ensure that those trash samples contained true human DNA, Rachel found over 125,000 distinct DNA sequences that were at least 1,000 base pairs in length. That is over 300 million base pairs or 10% of the current reference genome! On average, each individual in the cohort had 859 of these distinct long read insertions in their genome. Rachel concluded by asking if we are really ~99.9% similar with such long insertions that are just thrown out with the trash of genomic analysis. This mind-blowing conclusion led to a lot of excellent conversation among the attendees. Rachel’s lab is currently working on generating more reference genomes, to provide a better infrastructure for analysis of novel next generation sequencing data.