NSF IIS/III Improving de novo Genome Assembly using Optical Maps #1814359
News (last update Aug, 2018)
Project Goals and Research Challenges
Out of the estimated 8.7 million eukaryotic species on the planet, only a few thousands have been sequenced and assembled. While sequencing cost continues to decrease exponentially fast, the problem of de novo sequence assembly is still computationally challenging, in particular for large, repetitive genomes. New, cost-effective optical mapping technologies on the market (BioNano Genomics Irys and OpGen Argus) are creating opportunities to improve assembly contiguity and to reduce assembly errors (e.g., mis-joins). Our review of the literature for the major genomes released in the past two years shows that about half of these sequencing projects used an optical map to increase the contiguity of the assembly. Despite the importance of optical maps in the genome assembly pipeline, there is a surprisingly small set of automated software tools to allow users to take advantage of them. In fact, some of these steps (e.g., chimeric contig removal) are still carried out manually, which is tedious and error-prone.
The objective of this project is to develop innovative algorithmic solutions for automatically and accurately improve de novo genome assembly. Specifically, we want to provide user-friendly software tools to enable users to enhance assembly contiguity and reduce assembly errors (e.g., mis-joins) using optical maps. The proposed research plan is articulated around the following questions: how to take advantage of one or more optical maps i) to accurately detect and split chimeric contigs and chimeric molecules, ii) to accurately create scaffold genome assemblies, iii) to accurately stitch multiple (redundant) genome assemblies. We also plan to devise, test and deploy a user-friendly genome browser to visually inspect multiple optical maps (BioNano IrysView only allows one optical map at a time). Our approach to answer i), ii) and iii) is to frame these tasks as combinatorial optimization problems and provide efficient algorithms to compute optimal global solutions (or approximation guarantees).
This project will advance sequencing techniques for complex genomes. Software tools and the assembled data will be placed into the public domain, which will benefit researchers and the public worldwide, and potentially lead to new international and industrial collaborations. This project will directly support two graduate students in a highly interdisciplinary environment, building on UCR's strengths in Computer Science and Agricultural Sciences. Undergraduates will have opportunities to participate in research through a Research Experiences for Undergraduates (REU) site at UCR, a collaboration with a nearby community college, and a new US Department of Education Title V Hispanic Serving Institution grant (UCR is an accredited HSI). Young people from Riverside and San Bernardino counties will be inspired to pursue science and technology careers through demonstrations based on this project at outreach events such as the annual Bourns Science and Engineering Day. The PI will engage with the new Veteran Resource Center at UC Riverside, to engage veteran students in research, and prepare them to apply to graduate school.
- This material is based upon work supported by the National Science Foundation under Grant No.1814359