UCR

Computer Science and Engineering



Stefano Lonardi, Professor

Computational Biology and Bioinformatics


2024: Comprehensive assessment of 11 de novo HiFi assemblers on complex eukaryotic genomes and metagenomes Wenjuan Yu, Haohui Luo, Jinbao Yang, Shengchen Zhang, Heling Jiang, Xianjia Zhao, Xingqi Hui, Da Sun, Liang Li, Xiu-qing Wei, Stefano Lonardi and Weihua Pan
Genome Research, to appear(): to appear, 2024. PDF. doi:10.1101/gr.278232.123.

Pacific Biosciences (PacBio) HiFi sequencing technology generates long reads (>10 kbp) with very high accuracy (<0.01% sequencing error). Although several de novo assembly tools are available for HiFi reads, there are no comprehensive studies on the evaluation of these assemblers. We evaluated the performance of 11 de novo HiFi assemblers on (1) real data for three eukaryotic genomes; (2) 34 synthetic data sets with different ploidy, sequencing coverage levels, heterozygosity rates, and sequencing error rates; (3) one real metagenomic data set; and (4) five synthetic metagenomic data sets with different composition abundance and heterozygosity rates. The 11 assemblers were evaluated using quality assessment tool (QUAST) and benchmarking universal single-copy ortholog (BUSCO). We also used several additional criteria, namely, completion rate, single-copy completion rate, duplicated completion rate, average proportion of largest category, average distance difference, quality value, run-time, and memory utilization. Results show that hifiasm and hifiasm-meta should be the first choice for assembling eukaryotic genomes and metagenomes with HiFi data. We performed a comprehensive benchmarking study of commonly used assemblers on complex eukaryotic genomes and metagenomes. Our study will help the research community to choose the most appropriate assembler for their data and identify possible improvements in assembly algorithms.
2023: RAmbler: de novo genome assembly of complex repetitive regions. S. Chakravarty, G. Logsdon, S. Lonardi
ACM BCB 2023 - ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 13, Houston, TX, 2023. Proceedings. PDF. doi:10.1145/3584371.3612971.

Complex repetitive regions (also known as segmental duplications) in eukaryotic genomes often contain essential functional and regulatory information. Despite remarkable algorithmic progress in genome assembly in the last twenty years, modern de novo assemblers still struggle to accurately reconstruct these highly repetitive regions. When sequenced reads will be long enough to span all repetitive regions, the problem will be solved trivially. However, even the third generation of sequencing technologies on the market cannot yet produce reads that are sufficiently long (and accurate) to span every repetitive region in large eukaryotic genomes. In this work, we introduce a novel algorithm called RAmbler to resolve complex repetitive regions based on high-quality long reads (i.e., PacBio HiFi). We first identify potentially repetitive regions by mapping the HiFi reads to the draft genome assembly. Regions with sequencing coverage much higher then the average indicate a collapsed repeat in the assembly. Then, (i) we select HiFi reads that map to the repetitive region, including 50 kb upstream and downstream, (ii) we compute the k-mers that are expected to occur only once in the genome (i.e., single copy k-mers, which we call unikmers), (iii) we barcode the HiFi reads based on the presence and the location of their unikmers, (iv) we compute an overlap graph solely based on shared barcodes, (v) we reconstruct the sequence of the repetitive region by traversing the overlap graph. Our statistical analysis of the frequency distribution of k-mers from PacBio HiFi and Oxford Nanopore reads for different choices of k shows that the k-mers occurring in a window of three standard deviations around the mean of the distribution are true unikmers with very high probability, i.e., they occur in the genome only once with high probability. We present an extensive set of experiments comparing the performance of RAmbler against Hifiasm, HiCANU and Verkko on synthetic HiFi reads generated over a wide range of repeat lengths, number of repeats, heterozygosity rates and depth of sequencing (over 140 data sets). Our experimental results indicate that RAmbler outperforms Hifiasm, HiCANU and Verkko on the large majority of the inputs. We also show that RAmbler can resolve several long tandem repeats in Arabidopsis thaliana using real HiFi reads. In order to compare the performance of different assemblers, we introduce a new metric called assembly-score that takes into account completeness, contiguity and accuracy in a single number on a scale [0, 1]. In summary, our contributions are as follows. (i) We carry out the first statistical analysis for the method to extract single-copy k-mers for PacBio HiFi and Oxford Nanopore reads; (ii) we introduce a new assembly method that takes advantage of single-copy k-mers to resolve complex repetitive regions; (iii) we compare our method against state-of-the-art assemblers (Hifiasm, HiCANU and Verkko) on more than 140 synthetic data sets; (iv) we show that our method can resolve complex repetitive regions in A. thaliana using real HiFi data; (v) we introduce a new quality metric for genome assemblies that incorporates completeness, contiguity and the number of misassemblies. While the initial results are very promising, to establish RAmbler's true ability in resolving complex repeats, it will have to be tested more extensively on real data for other organisms. The code for RAmbler is available at https://github.com/sakshar/repeat_assembler.
2023: Babesia BdFE1 Esterase is Required for the Anti-parasitic Activity of the ACE Inhibitor Fosinopril. P. Vydyam, J.-Y. Choi, S. Gihaz, M. Chand, M. Gewirtz, J. Thekkiniath, S. Lonardi, J. C. Gennaro, C. Ben Mamoun
Journal of Biological Chemistry, : to appear, 2023. PDF. doi:10.1016/j.jbc.2023.105313.

Effective and safe therapies for the treatment of diseases caused by intraerythrocytic parasites are impeded by the rapid emergence of drug resistance and the lack of novel drug targets. One such disease is human babesiosis, which is a rapidly emerging tick-borne illness caused by Babesia parasites. In this study, we identified fosinopril, a phosphonate-containing, FDA-approved Angiotensin Converting Enzyme (ACE) inhibitor commonly used as a prodrug for hypertension and heart failure, as a potent inhibitor of B. duncani parasite development within human erythrocytes. Cell biological and mass spectrometry analyses revealed that the conversion of fosinopril to its active diacid molecule, fosinoprilat, is essential for its antiparasitic activity. We show that this conversion is mediated by a parasite-encoded esterase, BdFE1, which is highly conserved among apicomplexan parasites. Parasites carrying the L238H mutation in the active site of BdFE1 failed to convert the prodrug to its active moiety and became resistant to the drug. Our data set the stage for the development of this class of drugs for therapy of vector-borne parasitic diseases.
2023: On the Prediction of non-CG DNA Methylation using Machine Learning. S. Sereshki, N. Lee, M. Omirou, D. Fasoula, S. Lonardi
NAR Genomics and Bioinformatics, 5(2): lqad045, 2023. PDF. doi:10.1093/nargab/lqad045.

DNA methylation can be detected and measured using sequencing instruments after sodium bisulfite conversion, but experiments can be expensive for large eukaryotic genomes. Sequencing nonuniformity and mapping biases can leave parts of the genome with low or no coverage, thus hampering the ability of obtaining DNA methylation levels for all cytosines. To address these limitations, several computational methods have been proposed that can predict DNA methylation from the DNA sequence around the cytosine or from the methylation level of nearby cytosines. However, most of these methods are entirely focused on CG methylation in humans and other mammals. In this work, we study, for the first time, the problem of predicting cytosine methylation for CG, CHG and CHH contexts on six plant species, either from the DNA primary sequence around the cytosine or from the methylation levels of neighboring cytosines. In this framework, we also study the cross-species prediction problem and the cross-context prediction problem (within the same species). Finally, we show that providing gene and repeat annotations allows existing classifiers to significantly improve their prediction accuracy. We introduce a new classifier called AMPS (annotation-based methylation prediction from sequence) that takes advantage of genomic annotations to achieve higher accuracy.
2023: Babesia duncani Multi-omics Identifies Virulence Factors and Drug Targets. P. Singh, S. Lonardi, Q. Liang, P. Vydyam, E. Khabirova, T. Fang, S. Gihaz, J. Thekkiniath, M. Munshi, S. Abel, G. Batugedara, M. Gupta, X. M. Lu, T. Lenz, S. Chakravarty, E. Cornillot, Y. Hu, W. Ma, L. M. Gonzalez, S. Sanchez, A. Sanchez-Flores, K. Estrada, O. S. Harb, K. G. Le Roch, C. Ben Mamoun
Nature Microbiology, 8: 845–859, 2023. PDF. doi:10.1038/s41564-023-01360-8.

Babesiosis is a malaria-like disease in humans and animals that is caused by Babesia species, which are tick-transmitted apicomplexan pathogens. Babesia duncani causes severe to lethal infection in humans, but despite the risk that this parasite poses as an emerging pathogen, little is known about its biology, metabolic requirements or pathogenesis. Unlike other apicomplexan parasites that infect red blood cells, B. duncani can be continuously cultured in vitro in human erythrocytes and can infect mice resulting in fulminant babesiosis and death. We report comprehensive, detailed molecular, genomic, transcriptomic and epigenetic analyses to gain insights into the biology of B. duncani. We completed the assembly, 3D structure and annotation of its nuclear genome, and analysed its transcriptomic and epigenetics profiles during its asexual life cycle stages in human erythrocytes. We used RNA-seq data to produce an atlas of parasite metabolism during its intraerythrocytic life cycle. Characterization of the B. duncani genome, epigenome and transcriptome identified classes of candidate virulence factors, antigens for diagnosis of active infection and several attractive drug targets. Furthermore, metabolic reconstitutions from genome annotation and in vitro efficacy studies identified antifolates, pyrimethamine and WR-99210 as potent inhibitors of B. duncani to establish a pipeline of small molecules that could be developed as effective therapies for the treatment of human babesiosis.
2023: A View of the Pan-genome of Domesticated Cowpea (Vigna unguiculata [L.] Walp). Q. Liang, M. Munoz-Amatriain, S. Shu, S. Lo, X. Wu, J. W. Carlson, P. Davidson, D. M. Goodstein, J. Phillips, N. M. Janis, E. J. Lee, C. Liang, P. L. Morrell, A. D. Farmer, P. Xu, T. J. Close, S. Lonardi
The Plant Genome, e20319, 2023. PDF. doi:10.1002/tpg2.20319.

Cowpea, Vigna unguiculata L. Walp., is a diploid warm-season legume of critical importance as both food and fodder in sub-Saharan Africa. This species is also grown in Northern Africa, Europe, Latin America, North America, and East to Southeast Asia. To capture the genomic diversity of domesticates of this important legume, de novo genome assemblies were produced for representatives of six subpopulations of cultivated cowpea identified previously from genotyping of several hundred diverse accessions. In the most complete assembly (IT97K-499-35), 26,026 core and 4963 noncore genes were identified, with 35,436 pan genes when considering all seven accessions. GO terms associated with response to stress and defense response were highly enriched among the noncore genes, while core genes were enriched in terms related to transcription factor activity, and transport and metabolic processes. Over 5 million single nucleotide polymorphisms (SNPs) relative to each assembly and over 40 structural variants >1 Mb in size were identified by comparing genomes. Vu10 was the chromosome with the highest frequency of SNPs, and Vu04 had the most structural variants. Noncore genes harbor a larger proportion of potentially disruptive variants than core genes, including missense, stop gain, and frameshift mutations; this suggests that noncore genes substantially contribute to diversity within domesticated cowpea.
2022: Karyotype Variation, Spontaneous Genome Rearrangements Affecting Chemical Insensitivity, and Expression Level Polymorphisms in the Plant Pathogen Phytophthora infestans Revealed Using its First Chromosome-scale Assembly. M. E. H. Matson, Q. Liang, S. Lonardi, H. S. Judelson
PLoS Pathogens, 18(10): e1010869, 2022. PDF. doi:10.1371/journal.ppat.1010869.

Natural isolates of the potato and tomato pathogen Phytophthora infestans exhibit substantial variation in virulence, chemical sensitivity, ploidy, and other traits. A chromosome-scale assembly was developed to expand genomic resources for this oomyceteous microbe, and used to explore the basis of variation. Using PacBio and Illumina data, a long-range linking library, and an optical map, an assembly was created and coalesced into 15 pseudochromosomes spanning 219 Mb using SNP-based genetic linkage data. De novo gene prediction combined with transcript evidence identified 19,981 protein-coding genes, plus about eight thousand tRNA genes. The chromosomes were comprised of a mosaic of gene-rich and gene-sparse regions plus very long centromeres. Genes exhibited a biased distribution across chromosomes, especially members of families encoding RXLR and CRN effectors which clustered on certain chromosomes. Strikingly, half of F1 progeny of diploid parents were polyploid or aneuploid. Substantial expression level polymorphisms between strains were identified, much of which could be attributed to differences in chromosome dosage, transposable element insertions, and adjacency to repetitive DNA. QTL analysis identified a locus on the right arm of chromosome 3 governing sensitivity to the crop protection chemical metalaxyl. Strains heterozygous for resistance often experienced megabase-sized deletions of that part of the chromosome when cultured on metalaxyl, increasing resistance due to loss of the sensitive allele. This study sheds light on diverse phenomena affecting variation in P. infestans and relatives, helps explain the prevalence of polyploidy in natural populations, and provides a new foundation for biologic and genetic investigations.
2022: Genome-wide functional screens enable the prediction of high activity CRISPR-Cas9 and -Cas12a guides in Yarrowia lipolytica. D. R. Baisya, A. Ramesh, C. Schwartz, S. Lonardi, I. Wheeldon
Nature Communications, 13 922, 2022. PDF. doi:10.1038/s41467-022-28540-0.

Genome-wide functional genetic screens have been successful in discovering genotype-phenotype relationships and in engineering new phenotypes. While broadly applied in mammalian cell lines and in E. coli, use in non-conventional microorganisms has been limited, in part, due to the inability to accurately design high activity CRISPR guides in such species. Here, we develop an experimental-computational approach to sgRNA design that is specific to an organism of choice, in this case the oleaginous yeast Yarrowia lipolytica. A negative selection screen in the absence of non-homologous end-joining, the dominant DNA repair mechanism, was used to generate single guide RNA (sgRNA) activity profiles for both SpCas9 and LbCas12a. This genome-wide data served as input to a deep learning algorithm, DeepGuide, that is able to accurately predict guide activity. DeepGuide uses unsupervised learning to obtain a compressed representation of the genome, followed by supervised learning to map sgRNA sequence, genomic context, and epigenetic features with guide activity. Experimental validation, both genome-wide and with a subset of selected genes, confirms DeepGuide’s ability to accurately predict high activity sgRNAs. DeepGuide provides an organism specific predictor of CRISPR guide activity that with retraining could be applied to other fungal species, prokaryotes, and other non-conventional organisms.
2021: Reference‐agnostic Representation and Visualization of Pan‐genomes. Q. Liang, S. Lonardi
BMC Bioinformatics, 22 502, 2021. PDF. doi:10.1186/s12859-021-04424-w.

Background: The pan-genome of a species is the union of the genes and non-coding sequences present in all individuals (cultivar, accessions, or strains) within that species. Results: Here we introduce PGV, a reference-agnostic representation of the pan-genome of a species based on the notion of consensus ordering. Our experimental results demonstrate that PGV enables an intuitive, effective and interactive visualization of a pan-genome by providing a genome browser that can elucidate complex structural genomic variations. Conclusions: The PGV software can be installed via conda or downloaded from https://github.com/ucrbioinfo/PGV. The companion PGV browser at http://pgv.cs.ucr.edu can be tested using example bed tracks available from the GitHub page.
2020: Prediction of Histone Post-translational Modifications using Deep Learning. D. R. Baisya, S. Lonardi
Bioinformatics, 36(24): 5610–5617, 2020. PDF. doi:10.1093/bioinformatics/btaa1075.

Motivation: Histone post-translational modifications (PTMs) are involved in a variety of essential regulatory processes in the cell, including transcription control. Recent studies have shown that histone PTMs can be accurately predicted from the knowledge of transcription factor binding or DNase hypersensitivity data. Similarly, it has been shown that one can predict PTMs from the underlying DNA primary sequence. Results: In this study, we introduce a deep learning architecture called DeepPTM for predicting histone PTMs from transcription factor binding data and the primary DNA sequence. Extensive experimental results show that our deep learning model outperforms the prediction accuracy of the model proposed in Benveniste et al. (PNAS 2014) and DeepHistone (BMC Genomics 2019). The competitive advantage of our framework lies in the synergistic use of deep learning combined with an effective pre-processing step. Our classification framework has also enabled the discovery that the knowledge of a small subset of transcription factors (which are histone-PTM and cell-type-specific) can provide almost the same prediction accuracy that can be obtained using all the transcription factors data. Available from GitHub.
2020: Mustache: Multi-scale Detection of Chromatin Loops from Hi-C and Micro-C Maps using Scale-space Representation. A. R. Ardakany, H. T. Gezer, S. Lonardi, F. Ay
Genome Biology, 21: 256, 2020. PDF. doi:10.1186/s13059-020-02167-0.

We present Mustache, a new method for multi-scale detection of chromatin loops from Hi-C and Micro-C contact maps. Mustache employs scale-space theory, a technical advance in computer vision, to detect blob-shaped objects in contact maps. Mustache is scalable to kilobase-resolution maps and reports loops that are highly consistent between replicates and between Hi-C and Micro-C datasets. Compared to other loop callers, such as HiCCUPS and SIP, Mustache recovers a higher number of published ChIA-PET and HiChIP loops as well as loops linking promoters to regulatory elements. Overall, Mustache enables an efficient and comprehensive analysis of chromatin loops. Available from Github.
2020: DeeplyEssential: a Deep Neural Network for Predicting Essential Genes in Microbes. Md A. Hasan, S. Lonardi
BMC Bioinformatics, 21:367, 2020. PDF. doi:10.1186/s12859-020-03688-y.

Background: Essential genes are those genes that are critical for the survival of an organism. The prediction of essential genes in bacteria can provide targets for the design of novel antibiotic compounds or antimicrobial strategies. Results: We propose a deep neural network for predicting essential genes in microbes. Our architecture called DeeplyEssential makes minimal assumptions about the input data (i.e., it only uses gene primary sequence and the corresponding protein sequence) to carry out the prediction thus maximizing its practical application compared to existing predictors that require structural or topological features which might not be readily available. We also expose and study a hidden performance bias that effected previous classifiers. Extensive results show that DeeplyEssential outperform existing classifiers that either employ down-sampling to balance the training set or use clustering to exclude multiple copies of orthologous genes. Conclusion: Deep neural network architectures can efficiently predict whether a microbial gene is essential (or not) using only its sequence information. Available at Github.
2020: OMGS: Optical Map-based Genome Scaffolding. W. Pan, T. Jiang, S. Lonardi
Journal of Computational Biology, 27(4): 519-533, 2020. PDF. doi:10.1089/cmb.2019.0310.

Due to the current limitations of sequencing technologies, de novo genome assembly is typically carried out in two stages, namely contig (sequence) assembly and scaffolding. While scaffolding is computationally easier than sequence assembly, the scaffolding problem can be challenging due to the high repetitive content of eukaryotic genomes, possible mis-joins in assembled contigs and inaccuracies in the linkage information. Genome scaffolding tools either use paired-end/mate-pair/linked/Hi-C reads or genome-wide maps (optical, physical or genetic) as linkage information. Optical maps (in particular Bionano Genomics maps) have been extensively used in many recent large-scale genome assembly projects (e.g., goat, apple, barley, maize, quinoa, sea bass, among others). However, the most commonly used scaffolding tools have a serious limitation: they can only deal with one optical map at a time, forcing users to alternate or iterate over multiple maps. In this paper, we introduce a novel scaffolding algorithm called OMGS that for the first time can take advantages of multiple optical maps. OMGS solves several optimization problems to generate scaffolds with optimal contiguity and correctness. Extensive experimental results demonstrate that our tool outperforms existing methods when multiple optical maps are available, and produces comparable scaffolds using a single optical map. OMGS can be obtained from github.
2019: Seed Coat Pattern QTL and Development in Cowpea (Vigna unguiculata [L.] Walp.). I. A. Herniter, R. Lo, M. Munoz-Amatriain, S. Lo, Y.-N. Guo, B.-L. Huynh, M. Lucas, Z. Jia, P. A. Roberts, S. Lonardi, T. J. Close
Frontiers in Plant Science, 10: 1346, 2019. PDF. doi:10.3389/fpls.2019.01346.

The appearance of the seed is an important aspect of consumer preference for cowpea (Vigna unguiculata [L.] Walp.). Seed coat pattern in cowpea has been a subject of study for over a century. This study makes use of newly available resources, including mapping populations, a reference genome and additional genome assemblies, and a high-density single nucleotide polymorphism genotyping platform, to map various seed coat pattern traits to three loci, concurrent with the Color Factor (C), Watson (W), and Holstein (H) factors identified previously. Several gene models encoding proteins involved in regulating the later stages of the flavonoid biosynthesis pathway have been identified as candidate genes, including a basic helix-loop-helix gene (Vigun07g110700) for the C locus, a WD-repeat gene (Vigun09g139900) for the W locus and an E3 ubiquitin ligase gene (Vigun10g163900) for the H locus. A model of seed coat development, consisting of six distinct stages, is described to explain some of the observed pattern phenotypes.
2019: The Genome of Cowpea (Vigna unguiculata [L.] Walp.). S. Lonardi, M. Munoz-Amatriain, Q. Liang, S. Shu, S. I. Wanamaker, S. Lo, J. Tanskanen, A. H. Schulman, T. Zhu, M.-C. Luo, H. Alhakami, R. Ounit, A. Md Hasan, J. Verdier, P. A Roberts, J. RP Santos, A. Ndeve, J. Dolezel, J. Vrana, S. A. Hokin, A. D. Farmer, S. B. Cannon, T. J. Close
The Plant Journal, 98(5): 767-782, 2019. doi:10.1111/tpj.14349. PDF. Supplemental material. Lyza Maron's highlight 'Progress for the pea of prosperity'. Cover feature.

Cowpea (Vigna unguiculata [L.] Walp.) is a major crop for worldwide food and nutritional security, especially in sub-Saharan Africa, that is resilient to hot and drought-prone environments. An assembly of the single-haplotype inbred genome of cowpea IT97K-499-35 was developed by exploiting the synergies between single molecule real-time sequencing, optical and genetic mapping, and an assembly reconciliation algorithm. A total of 519 Mb is included in the assembled sequences. Nearly half of the assembled sequence is composed of repetitive elements, which are enriched within recombination-poor pericentromeric regions. A comparative analysis of these elements suggests that genome size differences between Vigna species are mainly attributable to changes in the amount of Gypsy retrotransposons. Conversely, genes are more abundant in more distal, high-recombination regions of the chromosomes; there appears to be more duplication of genes within the NBS-LRR and the SAUR-like auxin superfamilies compared to other warm-season legumes that have been sequenced. A surprising outcome is the identification of an inversion of 4.2 Mb among landraces and cultivars, which includes a gene that has been associated in other plants with interactions with the parasitic weed Striga gesnerioides. The genome sequence facilitated the identification of a putative syntelog for multiple organ gigantism in legumes. A revised numbering system has been adopted for cowpea chromosomes based on synteny with common bean (Phaseolus vulgaris). An estimate of nuclear genome size of 640.6 Mbp based on cytometry is presented.
2019: Validating Genome-wide CRISPR-Cas9 Function Improves Screening in the Oleaginous Yeast Yarrowia lipolytica. C. Schwartz, J.-F. Cheng, R. Evans, C. A. Schwartz, J. M. Wagner, S. Anglin, A. Beitz, W. Pan, S. Lonardi, M. Blenner, H. S. Alper, Y. Yoshikuni, I. Wheeldon
Metabolic Engineering, 55: pp.102-110, 2019. PDF. doi:10.1016/j.ymben.2019.06.007.

Genome-wide mutational screens are central to understanding the genetic underpinnings of evolved and engineered phenotypes. The widespread adoption of CRISPR-Cas9 genome editing has enabled such screens in many organisms, but identifying functional sgRNAs still remains a challenge. Here, we developed a methodology to quantify the cutting efficiency of each sgRNA in a genome-scale library, and in doing so improve screens in the biotechnologically important yeast Yarrowia lipolytica. Screening in the presence and absence of native DNA repair enabled high-throughput quantification of sgRNA function leading to the identification of high efficiency sgRNAs that cover 94% of genes. Library validation enhanced the classification of essential genes by identifying inactive guides that create false negatives and mask the effects of successful disruptions. Quantification of guide effectiveness also creates a dataset from which determinants of CRISPR-Cas9 can be identified. Finally, application of the library identified novel mutations for metabolic engineering of high lipid accumulation.
2019: Selfish: Discovery of Differential Chromatin Interactions via a Self-Similarity Measure. A. Roayaei Ardakany, F. Ay, S. Lonardi
ISMB 2019 - Conference on Intelligent Systems for Molecular Biology, 2019. Special Issue of Bioinformatics, 35(14): pp.i145–i153, 2019. doi:10.1093/bioinformatics/btz362. PDF.

Motivation: High-throughput conformation capture experiments such as Hi-C provide genome-wide maps of chromatin interactions, enabling life scientists to investigate the role of the three-dimensional structure of genomes in gene regulation and other essential cellular functions. A fundamental problem in the analysis of Hi-C data is how to compare two contact maps derived from Hi-C experiments. Detecting similarities and differences between contact maps is critical in evaluating the reproducibility of replicate experiments and identifying differential genomic regions with biological significance. Due to the complexity of chromatin conformations and the presence of technology-driven and sequence-specific biases, the comparative analysis of Hi-C data is analytically and computationally challenging.
Results: We present a novel method called SELFISH for the comparative analysis of Hi-C data that takes advantage of the structural self-similarity in contact maps. We define a novel self-similarity measure to design algorithms for (i) measuring reproducibility for Hi-C replicate experiments and (ii) finding differential chromatin interactions between two contact maps. Extensive experimental results on simulated and real data show that SELFISH is more accurate and robust than state-of-the-art methods.
Availability: GIT
2019: OMGS: Optical Map-based Genome Scaffolding. W. Pan, T. Jiang, S. Lonardi
RECOMB 2019 - ACM Annual Conference on Research in Computational Molecular Biology, pp.190--207, Washington, DC, 2019. Proceedings. PDF. doi:10.1007/978-3-030-17083-7_12.

Conference version of the JCB 2020 paper above.
2019: Accurate Detection of Chimeric Contigs via Bionano Optical Maps. W. Pan, S. Lonardi
Bioinformatics, 35(10): 1760-1762, 2019. PDF. Supplemental. doi:10.1093/bioinformatics/bty850.

A chimeric contig is contig that has been incorrectly assembled, i.e., a contig that contains one or more mis-joins. The detection of chimeric contigs can be carried out either by aligning assembled contigs to genome-wide maps (e.g., genetic, physical or optical maps) or by mapping sequenced reads to the assembled contigs. Here we introduce a software tool called Chimericognizer that takes advantage of one or more Bionano Genomics optical maps to accurately detect and correct chimeric contigs. Experimental results show that Chimericognizer is very accurate, and significantly better than the chimeric detection method offered by the Bionano Hybrid Scaffold pipeline. Chimericognizer can also detect and correct chimeric optical molecules.
2018: mClass: Cancer Type Classification with Somatic Point Mutation Data. Md Abid Hasan, Stefano Lonardi
RECOMB-CG 2018 - International Conference on Comparative Genomics, 131-145, Magog-Orford, Canada, 2018. Proceedings. PDF. doi:10.1007/978-3-030-00834-5_7.

Cancer is a complex disease associated with abnormal DNA mutations. Not all tumors are cancerous and not all cancers are the same. Correct cancer type diagnosis can indicate the most effective drug therapy and increase survival rate. At the molecular level, it has been shown that cancer type classification can be carried out from the analysis of somatic point mutation. However, the high dimensionality and sparsity of genomic mutation data, coupled with its small sample size has been a hindrance in accurate classification of cancer. We address these problems by introducing a novel classification method called mClass that accounts for the sparsity of the data. mClass is a feature selection method that ranks genes based on their similarity across samples and employs their normalized mutual information to determine the set of genes that provide optimal classification accuracy. Experimental results on TCGA datasets show that mClass significantly improves testing accuracy compared to DeepGene, which is the state-of-the-art in cancer-type classification based on somatic mutation data. In addition, when compared with other cancer gene prediction tools, the set of genes selected by mClass contains the highest number of genes in top 100 genes listed in the Cancer Gene Census. mClass is available at https://github.com/mdahasan/mClass.
2018: Novo&Stitch: Accurate Reconciliation of Genome Assemblies via Optical Maps. W. Pan, S. I. Wanamaker, A.M.V. Ah-Fong, H. Judelson, S. Lonardi
ISMB 2018 - Conference on Intelligent Systems for Molecular Biology, July 6-10, Chicago, 2018. Special Issue of Bioinformatics, 34(13): i43–i51, 2018. PDF. doi:10.1093/bioinformatics/bty255

Motivation: De novo genome assembly is a challenging computational problem due to the high repetitive content of eukaryotic genomes and the imperfections of sequencing technologies (i.e., sequencing errors, uneven sequencing coverage, and chimeric reads). Several assembly tools are currently available, each of which has strengths and weaknesses in dealing with the trade-off between maximizing contiguity and minimizing assembly errors (e.g., mis-joins). To obtain the best possible assembly, it is common practice to generate multiple assemblies from several assemblers and/or parameter settings and try to identify the highest quality assembly. Unfortunately, often there is no assembly that both maximizes contiguity and minimizes assembly errors, so one has to compromise one for the other.
Results: The concept of assembly reconciliation has been proposed as a way to obtain a higher quality assembly by merging or reconciling all the available assemblies. While several reconciliation methods have been introduced in the literature, we have shown in our recent paper in Genome Biology that none of them can consistently produce assemblies that are better than the assemblies provided in input. Here we introduce Novo&Stitch, a novel method that takes advantage of optical maps to accurately carry out assembly reconciliation (assuming that the assembled contigs are sufficiently long to be reliably aligned to the optical maps, e.g., 50Kbp or longer). Experimental results demonstrate that Novo&Stitch can double the contiguity (N50) of the input assemblies without introducing mis-joins or reducing genome completeness.
Availability: Novo&Stitch can be obtained from https://github.com/ucrbioinfo/Novo_Stitch
2018: A multi-parent advanced generation inter-cross (MAGIC) population for genetic analysis and improvement of cowpea (Vigna unguiculata L. Walp.). B.-L. Huynh, J. D. Ehlers, B. E. Huang, M. Munoz-Amatriain, S. Lonardi, J. R. P. Santos, A. Ndeve, B. J. Batieno, O. Boukar, N. Cisse, I. Drabo, C. Fatokun, F. Kusi, R. Y. Agyare, Y.-N. Guo, I. Herniter, S. Lo, S. I. Wanamaker, S. Xu, T. J. Close, P. A. Roberts
The Plant Journal, 93(6): 1129-1142, 2018. PDF. doi:10.1111/tpj.13827.

Multi-parent advanced generation inter-cross (MAGIC) populations are an emerging type of resource for dissecting the genetic structure of traits and improving breeding populations. We developed a MAGIC population for cowpea (Vigna unguiculata L. Walp.) from eight founder parents. These founders were genetically diverse and carried many abiotic and biotic stress resistance, seed quality and agronomic traits relevant to cowpea improvement in the United States and sub-Saharan Africa, where cowpea is vitally important in the human diet and local economies. The eight parents were inter-crossed using structured matings to ensure that the population would have balanced representation from each parent, followed by single-seed descent, resulting in 305 F8 recombinant inbred lines each carrying a mosaic of genome blocks contributed by all founders. This was confirmed by single nucleotide polymorphism genotyping with the Illumina Cowpea Consortium Array. These lines were on average 99.74% homozygous but also diverse in agronomic traits across environments. Quantitative trait loci (QTLs) were identified for several parental traits. Loci with major effects on photoperiod sensitivity and seed size were also verified by biparental genetic mapping. The recombination events were concentrated in telomeric regions. Due to its broad genetic base, this cowpea MAGIC population promises breakthroughs in genetic gain, QTL and gene discovery, enhancement of breeding populations and, for some lines, direct releases as new varieties.
2017: Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers. A. B. R. McIntyre, R. Ounit, E. Afshinnekoo, R. J. Prill, E. Henaff, N. Alexander, S. S. Minot, D. Danko, J. Foox, S. Ahsanuddin, S. Tighe, N. A. Hasan, P. Subramanian, K. Moffat, S. Levy, S. Lonardi, N. Greenfield, R. R. Colwell, G. L. Rosen, C. E. Mason
Genome Biology, 18: 182, 2017. PDF. doi:10.1186/s13059-017-1299-7.

Background:One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions:This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.
2017: ThIEF: Finding Genome-wide Trajectories of Epigenetics Marks. A. Polishko, Md. A. Hasan, W. Pan, E. M. Bunnik, K. Le Roch, S. Lonardi
WABI 2017 - Workshop on Algorithms in Bioinformatics, 19:1-19:16, Boston, MA, 2017. Proceedings. PDF. doi:10.4230/LIPIcs.WABI.2017.19.

We address the problem of comparing multiple genome-wide maps representing nucleosome positions or specific histone marks. These maps can originate from the comparative analysis of ChIP-Seq/MNase-Seq/FAIRE-Seq data for different cell types/tissues or multiple time points. The input to the problem is a set of maps, each of which is a list of genomics locations for nucleosomes or histone marks. The output is an alignment of nucleosomes/histone marks across time points (that we call trajectories), allowing small movements and gaps in some of the maps. We present a tool called ThIEF (TrackIng of Epigenetic Features) that can efficiently compute these trajectories. ThIEF comes into two 'flavors': ThIEF:Iterative finds the trajectories progressively using bipartite matching, while ThIEF:LP solves a k-partite matching problem on a hyper graph using linear programming. ThIEF:LP is guaranteed to find the optimal solution, but it is slower than ThIEF:Iterative. We demonstrate the utility of ThIEF by providing an example of applications on the analysis of temporal nucleosome maps for the human malaria parasite. As a surprisingly remarkable result, we show that the output of ThIEF can be used to produce a supervised classifier that can accurately predict the position of stable nucleosomes (i.e., nucleosomes present in all time points) and unstable nucleosomes (i.e., present in at most half of the time points) from the primary DNA sequence. To the best of our knowledge, this is the first result on the prediction of the dynamics of nucleosomes solely based on their DNA binding preference. Software is available at https://github.com/ucrbioinfo/ThIEF
2017: Efficient and Accurate Detection of Topologically Associating Domains from Contact Maps. A. R. Ardakany and S. Lonardi
WABI 2017 - Workshop on Algorithms in Bioinformatics, 22:1-22:11, Boston, MA, 2017. Proceedings. PDF. doi:10.4230/LIPIcs.WABI.2017.22.

Continuous improvements to high-throughput conformation capture (Hi-C) are revealing richer information about the spatial organization of the chromatin and its role in cellular functions. Several studies have confirmed the existence of structural features of the genome 3D organization that are stable across cell types and conserved across species, called topological associating domains (TADs). The detection of TADs has become a critical step in the analysis of Hi-C data, e.g., to identify enhancer-promoter associations. Here we present EAST, a novel TAD identification algorithm based on fast 2D convolution of Haar-like features, that is as accurate as the state-of-the-art method based on the directionality index, but 75-80x faster. EAST is available in the public domain at https://github.com/ucrbioinfo/EAST.
2017: A Comparative Evaluation of Genome Assembly Reconciliation Tools. H. Alhakami, H. Mirebrahim, Stefano Lonardi
Genome Biology, 18: 93, 2017. PDF. doi:10.1186/s13059-017-1213-3.

Background: The majority of eukaryotic genomes are unfinished due to the algorithmic challenges of assembling them. A variety of assembly and scaffolding tools are available, but it is not always obvious which tool or parameters to use for a specific genome size and complexity. It is, therefore, common practice to produce multiple assemblies using different assemblers and parameters, then select the best one for public release. A more compelling approach would allow one to merge multiple assemblies with the intent of producing a higher quality consensus assembly, which is the objective of assembly reconciliation. Results: Several assembly reconciliation tools have been proposed in the literature, but their strengths and weaknesses have never been compared on a common dataset. We fill this need with this work, in which we report on an extensive comparative evaluation of several tools. Specifically, we evaluate contiguity, correctness, coverage, and the duplication ratio of the merged assembly compared to the individual assemblies provided as input. Conclusions: None of the tools we tested consistently improved the quality of the input GAGE and synthetic assemblies. Our experiments show an increase in contiguity in the consensus assembly when the original assemblies already have high quality. In terms of correctness, the quality of the results depends on the specific tool, as well as on the quality and the ranking of the input assemblies. In general, the number of misassemblies ranges from being comparable to the best of the input assembly to being comparable to the worst of the input assembly.
2017: Construction of a Map-based Reference Genome Sequence for Barley, Hordeum vulgare L. S. Beier, A. Himmelbach, C. Colmsee, X.-Q. Zhang, R. A. Barrero, Q. Zhang, L. Li, M. Bayer, D. Bolser, S. Taudien, M. Groth, M. Felder, A. Hastie, H. Simkova, H. Stankova, J. Vrana, S. Chan, M. Munoz-Amatriain, R. Ounit, S. Wanamaker, T. Schmutzer, L. Aliyeva-Schnorr, S. Grasso, J. Tanskanen, D. Sampath, D. Heavens, S. Cao, B. Chapman, F. Dai, Y. Han, H. Li, X. Li, C. Lin, J. K. McCooke, C. Tan, S. Wang, S. Yin, G. Zhou, J. A. Poland, M. I. Bellgard, A. Houben, J. Dolezel, S. Ayling, S. Lonardi, P. Langridge, G. J. Muehlbauer, P. Kersey, M. D. Clark, M. Caccamo, A. H. Schulman, M. Platzer, T. J. Close, M. Hansson, G. Zhang, I. Braumann, C. Li, R. Waugh, U. Scholz, N. Stein, M. Mascher
Scientific Data (Nature), 4: 170044, 2017. PDF. doi:10.1038/sdata.2017.44.

Barley (Hordeum vulgare L.) is a cereal grass mainly used as animal fodder and raw material for the malting industry. The map-based reference genome sequence of barley cv. Morex was constructed by the International Barley Genome Sequencing Consortium (IBSC) using hierarchical shotgun sequencing. Here, we report the experimental and computational procedures to (i) sequence and assemble more than 80,000 bacterial artificial chromosome (BAC) clones along the minimum tiling path of a genome-wide physical map, (ii) find and validate overlaps between adjacent BACs, (iii) construct 4,265 non-redundant sequence scaffolds representing clusters of overlapping BACs, and (iv) order and orient these BAC clusters along the seven barley chromosomes using positional information provided by dense genetic maps, an optical map and chromosome conformation capture sequencing (Hi-C). Integrative access to these sequence and mapping resources is provided by the barley genome explorer (BARLEX).
2017: A Chromosome Conformation Capture Ordered Sequence of the Barley Genome. M. Mascher, H. Gundlach, A. Himmelbach, S. Beier, S. O. Twardziok, T. Wicker, V. Radchuk, C. Dockter, P. E. Hedley, J. Russell, M. Bayer, L. Ramsay, H. Liu, G. Haberer, X.-Q. Zhang, Q. Zhang, R. A. Barrero, L. Li, S. Taudien, M. Groth, M. Felder, A. Hastie, H. Simkova, H. Stankova, J. Vrana, S. Chan, M. Munoz-Amatriain, R. Ounit, S. Wanamaker, D. Bolser, C. Colmsee, T. Schmutzer, L. Aliyeva-Schnorr, S. Grasso, J. Tanskanen, A. Chailyan, D. Sampath, D. Heavens, L. Clissold, S. Cao, B. Chapman, F. Dai, Y. Han, H. Li, X. Li, C. Lin, J. K. McCooke, C. Tan, P. Wang, S. Wang, S. Yin, G. Zhou, J. A. Poland, M. I. Bellgard, L. Borisjuk, A. Houben, J. Dolezel, S. Ayling, S. Lonardi, P. Kersey, P. Langridge, G. J. Muehlbauer, M. D. Clark, M. Caccamo, A. H. Schulman, K.F.X. Mayer, M. Platzer, T. J. Close, U. Scholz, M. Hansson, G. Zhang, I. Braumann, M. Spannagl, C. Li, R. Waugh, N. Stein
Nature, 544: 427-433, 2017. PDF. doi:10.1038/nature22043.

Cereal grasses of the Triticeae tribe have been the major food source in temperate regions since the dawn of agriculture. Their large genomes are characterized by a high content of repetitive elements and large pericentromeric regions that are virtually devoid of meiotic recombination. Here we present a high-quality reference genome assembly for barley (Hordeum vulgare L.). We use chromosome conformation capture mapping to derive the linear order of sequences across the pericentromeric space and to investigate the spatial organization of chromatin in the nucleus at megabase resolution. The composition of genes and repetitive elements differs between distal and proximal regions. Gene family analyses reveal lineage-specific duplications of genes involved in the transport of nutrients to developing seeds and the mobilization of carbohydrates in grains. We demonstrate the importance of the barley reference sequence for breeding by inspecting the genomic partitioning of sequence variation in modern elite germplasm, highlighting regions vulnerable to genetic erosion.
2016: Genome Resources for Climate-resilient Cowpea, an Essential Crop for Food Security. M. Munoz-Amatriain, H. Mirebrahim, P. Xu, S. I. Wanamaker, M.C. Luo, H. Alhakami, M. Alpert, I. Atokple, B. J. Batieno, O. Boukar, S. Bozdag, N. Cisse, I. Drabo, J. D. Ehlers, A. Farmer, C. Fatokun, Y. Q. Gu, Y.-N. Guo, B.-L. Huynh, S. A. Jackson, F. Kusi, C. T. Lawley, M. R. Lucas, Y. Ma, M. P. Timko, J. Wu, F. You, P. A. Roberts, S. Lonardi, T. J. Close
The Plant Journal, 89(5): 1042-1054, 2016. PDF. doi:10.1111/tpj.13404.

Cowpea (Vigna unguiculata L. Walp.) is a legume crop that is resilient to hot and drought-prone climates, and a primary source of protein in sub-Saharan Africa and other parts of the developing world. However, genome resources for cowpea have lagged behind most other major crops. Here we describe foundational genome resources and their application to the analysis of germplasm currently in use in West African breeding programs. Resources developed from the African cultivar IT97K-499-35 include a whole-genome shotgun (WGS) assembly, a bacterial artificial chromosome (BAC) physical map, and assembled sequences from 4355 BACs. These resources and WGS sequences of an additional 36 diverse cowpea accessions supported the development of a genotyping assay for 51 128 SNPs, which was then applied to five bi-parental RIL populations to produce a consensus genetic map containing 37 372 SNPs. This genetic map enabled the anchoring of 100 Mb of WGS and 420 Mb of BAC sequences, an exploration of genetic diversity along each linkage group, and clarification of macrosynteny between cowpea and common bean. The SNP assay enabled a diversity analysis of materials from West African breeding programs. Two major subpopulations exist within those materials, one of which has significant parentage from South and East Africa and more diversity. There are genomic regions of high differentiation between subpopulations, one of which coincides with a cluster of nodulin genes. The new resources and knowledge help to define goals and accelerate the breeding of improved varieties to address food security issues related to limited-input small-holder farming and climate stress.
2016: rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison. L. Hahn, C.-A. Leimeister, R. Ounit, S. Lonardi, B. Morgenstern
PLoS Computational Biology, 12(10): e1005107, 2016. PDF. doi:10.1371/journal.pcbi.1005107.

Summary: Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/
2016: Higher Classification Sensitivity of Short Metagenomic Reads with CLARK-S. R. Ounit, S. Lonardi
Bioinformatics, 32(24): 3823-3825, 2016. PDF. Supplemental. doi:10.1093/bioinformatics/btw542.

Summary: The growing number of metagenomic studies in medicine and environmental sciences is creating increasing demands on the computational infrastructure designed to analyze these very large datasets. Often, the construction of ultra-fast and precise taxonomic classifiers can compromise on their sensitivity (i.e. the number of reads correctly classified). Here we introduce CLARK-S, a new software tool that can classify short reads with high precision, high sensitivity and high speed. Availability and Implementation: CLARK-S is freely available at http://clark.cs.ucr.edu
2016: Brat-Nova: Fast and Accurate Mapping of Bisulfite-treated Reads. E. Y. Harris, R. Ounit, S. Lonardi
Bioinformatics, 32(17): 2696-2698, 2016. PDF. Supplemental. doi:10.1093/bioinformatics/btw226.

Summary: In response to increasing amounts of sequencing data, faster and faster aligners need to become available. Here, we introduce BRAT-nova, a completely rewritten and improved implementation of the mapping tool BRAT-BW for bisulfite-treated reads (BS-Seq). BRAT-nova is very fast and accurate. On the human genome, BRAT-nova is 2-7 times faster than state-of-the-art aligners, while maintaining the same percentage of uniquely mapped reads and space usage. On synthetic reads, BRAT-nova is 2-8 times faster than state-of-the-art aligners while maintaining similar mapping accuracy, methylation call accuracy, methylation level accuracy and space efficiency. Availability and implementation: The software is available in the public domain at http://compbio.cs.ucr.edu/brat/
2015: Analysis of Nucleosome Positioning Landscapes Enables Gene Discovery in the Human Malaria Parasite Plasmodium falciparum. X. M. Lu, E. M. Bunnik, N. Pokhriyal, S. Nasseri, S. Lonardi, K. G. Le Roch
BMC Genomics, 16 1005, 2015. PDF. doi:10.1186/s12864-015-2214-9.

Background: Plasmodium falciparum, the deadliest malaria-causing parasite, has an extremely AT-rich (80.7 %) genome. Because of high AT-content, sequence-based annotation of genes and functional elements remains challenging. In order to better understand the regulatory network controlling gene expression in the parasite, a more complete genome annotation as well as analysis tools adapted for AT-rich genomes are needed. Recent studies on genome-wide nucleosome positioning in eukaryotes have shown that nucleosome landscapes exhibit regular characteristic patterns at the 5'- and 3'-end of protein and non-protein coding genes. In addition, nucleosome depleted regions can be found near transcription start sites. These unique nucleosome landscape patterns may be exploited for the identification of novel genes. In this paper, we propose a computational approach to discover novel putative genes based exclusively on nucleosome positioning data in the AT-rich genome of P. falciparum. Results: Using binary classifiers trained on nucleosome landscapes at the gene boundaries from two independent nucleosome positioning data sets, we were able to detect a total of 231 regions containing putative genes in the genome of Plasmodium falciparum, of which 67 highly confident genes were found in both data sets. Eighty-eight of these 231 newly predicted genes exhibited transcription signal in RNA-Seq data, indicative of active transcription. In addition, 20 out of 21 selected gene candidates were further validated by RT-PCR, and 28 out of the 231 genes showed significant matches using BLASTN against an expressed sequence tag (EST) database. Furthermore, 108 (47 %) out of the 231 putative novel genes overlapped with previously identified but unannotated long non-coding RNAs. Collectively, these results provide experimental validation for 163 predicted genes (70.6 %). Finally, 73 out of 231 genes were found to be potentially translated based on their signal in polysome-associated RNA-Seq representing transcripts that are actively being translated. Conclusion: Our results clearly indicate that nucleosome positioning data contains sufficient information for novel gene discovery. As distinct nucleosome landscapes around genes are found in many other eukaryotic organisms, this methodology could be used to characterize the transcriptome of any organism, especially when coupled with other DNA-based gene finding and experimental methods (e.g., RNA-Seq).
2015: Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers. R. Ounit, S. Lonardi
WABI 2015 - Workshop on Algorithms in Bioinformatics, pp.286-295, LNBI 9289, Atlanta, GA, 2015. Proceedings. PDF. doi:10.1007/978-3-662-48221-6_21.

The growing number of metagenomic studies in medicine and environmental sciences is creating new computational demands in the analysis of these very large datasets. We have recently proposed a timeefficient algorithm called Clark that can accurately classify metagenomic sequences against a set of reference genomes. The competitive advantage of Clark depends on the use of discriminative contiguous kmers. In default mode, Clark's speed is currently unmatched and its precision is comparable to the state-of-the-art, however, its sensitivity still does not match the level of the most sensitive (but slowest) metagenomic classifier. In this paper, we introduce an algorithmic improvement that allows Clark's classification sensitivity to match the best metagenomic classifier, without a significant loss of speed or precision compared to the original version. Finally, on real metagenomes, Clark can assign with high accuracy a much higher proportion of short reads than its closest competitor. The improved version of Clark, based on discriminative spaced k-mers, is freely available at http://clark.cs.ucr.edu/Spaced/.
2015: Scrible: Ultra-Accurate Error-Correction of Pooled Sequenced Reads. D. Duma, F. Cordero, M. Beccuti, G. Ciardo, T. J. Close, S. Lonardi
WABI 2015 - Workshop on Algorithms in Bioinformatics, pp. 162-174, LNBI 9289, Atlanta, GA, 2015. Proceedings. PDF. doi:10.1007/978-3-662-48221-6_12.

We recently proposed a novel clone-by-clone protocol for de novo genome sequencing that leverages combinatorial pooling design to overcome the limitations of DNA barcoding when multiplexing a large number of samples on second-generation sequencing instruments. Here we address the problem of correcting the short reads obtained from our sequencing protocol. We introduce a novel algorithm called Scrible that exploits properties of the pooling design to accurately identify/correct sequencing errors and minimize the chance of over-correcting. Experimental results on synthetic data on the rice genome demonstrate that our method has much higher accuracy in correcting short reads compared to state-of-the-art error-correcting methods. On real data on the barley genome we show that Scrible significantly improves the decoding accuracy of short reads to individual BACs.
2015: Sequencing of 15,622 Gene-bearing BACs Clarifies the Gene-dense Regions of the Barley Genome. M. Munoz-Amatriain*, S. Lonardi*, M-C. Luo, K. Madishetty, J. T. Svensson, M. J. Moscou, S. Wanamaker, T. Jiang, A. Kleinhofs, G. J. Muehlbauer, R. P. Wise, N. Stein, Y. Ma, E. Rodriguez, D. Kudrna, P. R. Bhat, S. Chao, P. Condamine, S. Heinen, J. Resnik, R. Wing, H. N Witt, M. Alpert, M. Beccuti, S. Bozdag, F. Cordero, H. Mirebrahim, R. Ounit, Y. Wu, F. You, J. Zheng, H. Simkova, J. Dolezel, J. Grimwood, J. Schmutz, D. Duma, L. Altschmied, T. Blake, P. Bregitzer, L. Cooper, M. Dilbirligi, A. Falk, L. Feiz, A. Graner, P. Gustafson, P. M. Hayes, P. Lemaux, J. Mammadov, T. J. Close (* equal contributors)
The Plant Journal, 84(1): 216-227, 2015. PDF. doi:10.1111/tpj.12959.

Barley (Hordeum vulgare L.) possesses a large and highly repetitive genome of 5.1 Gb that has hindered the development of a complete sequence. In 2012, the International Barley Sequencing Consortium released a resource integrating whole-genome shotgun sequences with a physical and genetic framework. However, because only 6,278 BACs in the physical map were sequenced, fine structure was limited. To gain access to the gene-containing portion of the barley genome at high resolution, we identified and sequenced 15,622 BACs representing the minimal tiling path of 72,052 physical-mapped gene-bearing BACs. This generated ~1.7 Gb of genomic sequence containing an estimated 2/3 of all Morex barley genes. Exploration of these sequenced BACs revealed that although distal ends of chromosomes contain most of the gene-enriched BACs and are characterized by high recombination rates, there are also gene-dense regions with suppressed recombination. We made use of published map-anchored sequence data from Aegilops tauschii to develop a synteny viewer between barley and the ancestor of the wheat D genome. Except for some notable inversions, there is a high level of collinearity between the two species. The software HarvEST:Barley provides facile access to BAC sequences and their annotations, along with the barley-Ae. tauschii synteny viewer. These BAC sequences constitute a resource to improve the efficiency of marker development, map-based cloning, and comparative genomics in barley and related crops. Additional knowledge about regions of the barley genome that are gene-dense but low-recombination is particularly relevant.
2015: When Less is More: "Slicing" Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality. S. Lonardi, H. Mirebrahim, S. Wanamaker, M. Alpert, G. Ciardo, D. Duma, T. J. Close
Bioinformatics, 31(18): 2972-2980, 2015. PDF. Supplemental. doi:10.1093/bioinformatics/btv311.

Motivation: Since the invention of DNA sequencing in the seventies, computational biologists have had to deal with the problem of de novo genome assembly with limited (or insufficient) depth of sequencing. In this work, we investigate the opposite problem, that is, the challenge of dealing with excessive depth of sequencing.
Results: We explore the effect of ultra-deep sequencing data in two domains: (i) the problem of decoding reads to BAC clones (in the context of the combinatorial pooling design proposed in (Lonardi et al., 2013)), and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, we show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data. For the first problem, we propose an effective solution based on "divide and conquer": we "slice" a large dataset into smaller samples of optimal size, decode each slice independently, then merge the results. Experimental results on over 15,000 barley BACs and over 4,000 cowpea BACs demonstrate a significant improvement in the quality of the decoding and the final assembly. For the second problem, we show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data.
Availability: Python scripts to process slices and resolve decoding conflicts are available from http://goo.gl/YXgdHT; software HASHFILTER can be downloaded from http://goo.gl/MIyZHs
2015: De Novo Meta-Assembly of Ultra-deep Sequencing Data. H. Mirebrahim, T. J. Close, S. Lonardi
ISMB/ECCB 2015 - Conference on Intelligent Systems for Molecular Biology and European Conference on Computational Biology, July 10-14, Dublin, 2015. Special Issue of Bioinformatics, 31(12): i9-i16, 2015. PDF. doi:10.1093/bioinformatics/btv226

We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e., coverage of 1,000x or higher). Our proposed meta-assembler SLICEMBLER partitions the input data into optimal-sized "slices" and uses a standard assembly tool (e.g., Velvet, SPAdes, IDBA, Ray) to assemble each slice individually. SLICEMBLER uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, SLICEMBLER uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8,000x coverage) and simulated data show that SLICEMBLER significantly improves the quali-ty of the assembly compared to the performance of the base as-sembler. In fact, most of the times SLICEMBLER generates error-free assemblies. We also show that SLICEMBLER is much more resistant against high sequencing error rate than the base assembler. SLICEMBLER can be accessed at http://slicembler.cs.ucr.edu/
2015: FHAST: FPGA-based Acceleration of Bowtie in Hardware. E. B. Fernandez, J. Villarreal, S. Lonardi, W. A. Najjar
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(5): 973-981, 2015. PDF. doi:10.1109/TCBB.2015.2405333. BibTex entry.

While the sequencing capability of modern instruments continues to increase exponentially, the computational problem of mapping short sequenced reads to a reference genome still constitutes a bottleneck in the analysis pipeline. A variety of mapping tools (e.g., BOWTIE, BWA) is available for general-purpose computer architectures. These tools can take many hours or even days to deliver mapping results, depending on the number of input reads, the size of the reference genome and the number of allowed mismatches or insertion/deletions, making the mapping problem an ideal candidate for hardware acceleration. In this paper, we present FHAST (FPGA Hardware Accelerated Sequence-matching Tool), a drop-in replacement for BOWTIE that uses a hardware design based on Field Programmable Gate Arrays (FPGA). Our architecture masks memory latency by executing multiple concurrent hardware threads accessing memory simultaneously. FHAST is composed by multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 and later ported on the Convey HC-2ex, taking advantage of the large memory bandwidth available to these systems and the shared memory image between hardware and software. A preliminary version of FHAST running on the Convey HC-1 achieved up to 70x speedup compared to BOWTIE (single-threaded). An improved version of FHAST running on the Convey HC-2ex FPGAs achieved up to 12x fold speed gain compared to BOWTIE running eight threads on an 8-core conventional architecture, while maintaining almost identical mapping accuracy. FHAST is a drop-in replacement for BOWTIE, so it can be incorporated in any analysis pipeline that uses BOWTIE (e.g., TOPHAT).
2015: CLARK: Fast and Accurate Classification of Metagenomic and Genomic Sequences using Discriminative K-mers. R. Ounit, S. Wanamaker, T. J. Close, S. Lonardi
BMC Genomics, 236(16): 13 pages, 2015. PDF. doi:10.1186/s12864-015-1419-2. BibTex entry.

Background: The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce. Results: We introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions. Conclusions: CLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/.
2014: Generating and Reversing Chronic Wounds in Diabetic Mice by Manipulating Wound Redox Parameters. S. Dhall, D. C. Do, M. Garcia, J. Kim, H. Mirebrahim, J. Lyubovitsky, S. Lonardi, E. A. Nothnagel, N. L. Schiller, M. Martins-Green
Journal of Diabetes Research, 2014(562625): 18 pages, 2014. PDF. doi:10.1155/2014/562625. BibTex entry.

By 2025, more than 500M people worldwide will suffer from diabetes; 125M will develop foot ulcer(s) and 20M will undergo an amputation, creating a major health problem. Understanding how these wounds become chronic will provide insights to reverse chronicity. We hypothesized that oxidative stress (OS) in wounds is a critical component for generation of chronicity. We used the db/db mouse model of impaired healing and inhibited, at time of injury, two major antioxidant enzymes, catalase and glutathione peroxidase, creating high OS in the wounds. This was necessary and sufficient to trigger wounds to become chronic. The wounds initially contained a polymicrobial community that with time selected for specific biofilm-forming bacteria. To reverse chronicity we treated the wounds with the antioxidants alpha-tocopherol and N-acetylcysteine and found that OS was highly reduced, biofilms had increased sensitivity to antibiotics, and granulation tissue was formed with proper collagen deposition and remodeling. We show for the first time generation of chronic wounds in which biofilm develops spontaneously, illustrating importance of early and continued redox imbalance coupled with the presence of biofilm in development of wound chronicity. This model will help decipher additional mechanisms and potentially better diagnosis of chronicity and treatment of human chronic wounds.
2014: PuFFIN: A Parameter-free Method to Build Nucleosome Maps from Paired-end Reads. A. Polishko, E. Bunnik, K. Le Roch, S. Lonardi
BMC Bioinformatics, 15(Suppl 9): S11, 2014. doi:10.1186/1471-2105-15-S9-S11. Presented at RECOMB Workshop on Massively Parallel Sequencing (RECOMB-SEQ'14), Pittsburgh, PA, USA. PDF. BibTex entry.

Background: We introduce a novel method, called PuFFIN, that takes advantage of paired-end short reads to build genome-wide nucleosome maps with larger numbers of detected nucleosomes and higher accuracy than existing tools. In contrast to other approaches that require users to optimize several parameters according to their data (e.g., the maximum allowed nucleosome overlap or legal ranges for the fragment sizes) our algorithm can accurately determine a genome-wide set of non-overlapping nucleosomes without any user-defined parameter. This feature makes PuFFIN significantly easier to use and prevents users from choosing the "wrong" parameters and obtain sub-optimal nucleosome maps. Results: PuFFIN builds genome-wide nucleosome maps using a multi-scale (or multi-resolution) approach. Our algorithm relies on a set of nucleosome "landscape functions at different resolution levels: each function represents the likelihood of each genomic location to be occupied by a nucleosome for a particular value of the smoothing parameter. After a set of candidate nucleosomes is computed for each function, PuFFIN produces a consensus set that satisfies non-overlapping constraints and maximizes the number of nucleosomes. Conclusions: We report comprehensive experimental results that compares PuFFIN with recently published tools (NOrMAL,Template Filtering, and NucPosSimulator) on several synthetic datasets as well as real data for S. cerevisiae and P. falciparum. Experimental results show that our approach produces more accurate nucleosome maps with a higher number of non-overlapping nucleosomes than other tools.
2014: Deciphering Histone Code of Transcriptional Regulation in Malaria Parasites by Large-scale Data Mining. H. Chen, S. Lonardi, J. Zheng
Computational Biology and Chemistry, 50: 3-10, 2014. Presented at Advances in Bioinformatics: Twelfth Asia Pacific Bioinformatics Conference (APBC2014). PDF. doi:10.1016/j.compbiolchem.2014.01.002. BibTex entry.

Histone modifications play a major role in the regulation of gene expression. Accumulated evidence has shown that histone modifications mediate biological processes such as transcription cooperatively. This has led to the hypothesis of "histone code" which suggests that combinations of different histone modifications correspond to unique chromatin states and have distinct functions. In this paper, we propose a framework based on association rule mining to discover the potential regulatory relations between histone modifications and gene expression in Plasmodium falciparum. Our approach can output rules with statistical significance. Some of the discovered rules are supported by literature of experimental results. Moreover, we have also discovered de novo rules which can guide further research in epigenetic regulation of transcription. Based on our association rules we build a model to predict gene expression, which outperforms a published Bayesian network model for gene expression prediction by histone modifications. The results of our study reveal mechanisms for histone modifications to regulate transcription in large-scale. Among our findings, the cooperation among histone modifications provides new evidence for the hypothesis of histone code. Furthermore, the rules output by our method can be used to predict the change of gene expression.
2014: DNA-encoded nucleosome occupancy is associated with transcription levels in the human malaria parasite Plasmodium falciparum. E. M. Bunnik, A. Polishko, J. Prudhomme, N. Ponts, S. S. Gill, S. Lonardi, K. G. Le Roch
BMC Genomics, 15:347, 2014. PDF. doi:10.1186/1471-2164-15-347.

Background: In eukaryotic organisms, packaging of DNA into nucleosomes controls gene expression by regulating access of the promoter to transcription factors. The human malaria parasite Plasmodium falciparum encodes relatively few transcription factors, while extensive nucleosome remodeling occurs during its replicative cycle in red blood cells. These observations point towards an important role of the nucleosome landscape in regulating gene expression. However, the relation between nucleosome positioning and transcriptional activity has thus far not been explored in detail in the parasite. Results: Here, we analyzed nucleosome positioning in the asexual and sexual stages of the parasite's erythrocytic cycle using chromatin immunoprecipitation of MNase-digested chromatin, followed by next-generation sequencing. We observed a relatively open chromatin structure at the trophozoite and gametocyte stages, consistent with high levels of transcriptional activity in these stages. Nucleosome occupancy of genes and promoter regions were subsequently compared to steady-state mRNA expression levels. Transcript abundance showed a strong inverse correlation with nucleosome occupancy levels in promoter regions. In addition, AT-repeat sequences were strongly unfavorable for nucleosome binding in P. falciparum, and were overrepresented in promoters of highly expressed genes. Conclusions: The connection between chromatin structure and gene expression in P. falciparum shares similarities with other eukaryotes. However, the remarkable nucleosome dynamics during the erythrocytic stages and the absence of a large variety of transcription factors may indicate that nucleosome binding and remodeling are critical regulators of transcript levels. Moreover, the strong dependency between chromatin structure and DNA sequence suggests that the P. falciparum genome may have been shaped by nucleosome binding preferences. Nucleosome remodeling mechanisms in this deadly parasite could thus provide potent novel anti-malarial targets.
2014: Identification of candidate genes and molecular markers for heat-induced brown discoloration of seed coats in cowpea [Vigna unguiculata (L.) Walp]. M. Pottorff, P. A. Roberts, T. J. Close, S. Lonardi, S. Wanamaker, J. D. Ehlers
BMC Genomics, 15:(328), 2014. PDF. doi:10.1186/1471-2164-15-328. BibTex entry.

Background. Heat-induced browning (Hbs) of seed coats is caused by high temperatures which discolors the seed coats of many legumes, affecting the visual appearance and quality of seeds. The genetic determinants underlying Hbs in cowpea are unknown. Results. We identified three QTL associated with the heat-induced browning of seed coats trait, Hbs-1, Hbs-2 and Hbs-3, using cowpea RIL populations IT93K-503-1 (Hbs positive) x CB46 (hbs negative) and IT84S-2246 (Hbs positive) x TVu14676 (hbs negative). Hbs-1 was identified in both populations, accounting for 28.3% -77.3% of the phenotypic variation. SNP markers 1_0032 and 1_1128 co-segregated with the trait. Within the syntenic regions of Hbs-1 in soybean, Medicago and common bean, several ethylene forming enzymes, ethylene responsive element binding factors and an ACC oxidase 2 were observed. Hbs-1 was identified in a BAC clone in contig 217 of the cowpea physical map, where ethylene forming enzymes were present. Hbs-2 was identified in the IT93K-503-1 x CB46 population and accounted for of 9.5 to 12.3% of the phenotypic variance. Hbs-3 was identified in the IT84S-2246 x TVu14676 population and accounted for 6.2 to 6.8% of the phenotypic variance. SNP marker 1_0640 co-segregated with the heat-induced browning phenotype. Hbs-3 was positioned on BAC clones in contig512 of the cowpea physical map, where several ACC synthase 1 genes were present. Conclusion. The identification of loci determining heat-induced browning of seed coats and co-segregating molecular markers will enable transfer of hbs alleles into cowpea varieties, contributing to higher quality seeds.
2013: Genome-scale discovery of DNA methylations in the human malaria parasite Plasmodium falciparum. N. Ponts, L. Fu, E. Y. Harris, J. Zhang, D.-W. D. Chung, E. Bunnik, M. C. Cervantes, J. Prudhomme, V. Atanasova-Penichon, E. Zehraoui, E. M. Rodrigues, S. Lonardi, G. R. Hicks, Y. Wang, K. G. Le Roch
Cell Host and Microbe, 14(6): 696-706, 2013. PDF. doi:10.1016/j.chom.2013.11.007. Faculty of 1000 review.

Cytosine DNA methylation is an epigenetic mark in most eukaryotic cells that regulates numerous processes, including gene expression and stress responses. We performed a genome-wide analysis of DNA methylation in the human malaria parasite Plasmodium falciparum. We mapped the positions of methylated cytosines and identified a single functional DNA methyltransferase (Plasmodium falciparum DNA methyltransferase; PfDNMT) that may mediate these genomic modifications. These analyses revealed that the malaria genome is asymmetrically methylated and shares common features with undifferentiated plant and mammalian cells. Notably, core promoters are hypomethylated, and transcript levels correlate with intraexonic methylation. Additionally, there are sharp methylation transitions at nucleosome and exon-intron boundaries. These data suggest that DNA methylation could regulate virulence gene expression and transcription elongation. Furthermore, the broad range of action of DNA methylation and the uniqueness of PfDNMT suggest that the methylation pathway is a potential target for antimalarial strategies.
2013: Accurate Decoding of Pooled Sequenced Data Using Compressed Sensing. D. Duma, M. Wootters, A. C. Gilbert, H. Q. Ngo, A. Rudra, M. Alpert, T. J. Close, G. Ciardo and S. Lonardi
WABI 2013 - Workshop on Algorithms in Bioinformatics, pp.70-84, LNBI 8126, Sophia Antipolis, France, 2013. PDF doi:10.1007/978-3-642-40453-5_7.

In order to overcome the limitations imposed by DNA barcoding when multiplexing a large number of samples in the current generation of high-throughput sequencing instruments, we have recently proposed a new protocol that leverages advances in combinatorial pooling design (group testing). We have also demonstrated how this new protocol would enable de novo selective sequencing and assembly of large, highly-repetitive genomes. Here we address the problem of decoding pooled sequenced data obtained from such a protocol. Our algorithm employs a synergistic combination of ideas from compressed sensing and the decoding of error-correcting codes. Experimental results on synthetic data for the rice genome and real data for the barley genome show that our novel decoding algorithm enables significantly higher quality assemblies than the previous approach.
2013: Mechanisms of small RNA generation from cis-NATs in response to environmental and developmental cues. X. Zhang, Y. Lii, Z. Wu, A. Polishko, H. Zhang, V. Chinnusamy, S. Lonardi, J.-K. Zhu, R. Liu and H. Jin
Molecular Plant, 6(3): 704-715, 2013. PDF. doi:10.1093/mp/sst051.

A large proportion of eukaryotic genomes is transcribed from both the positive and negative strands of the DNA and thus may generate overlapping sense and antisense transcripts. Some of these so-called natural antisense transcripts are possibly co-regulated. When the overlapping sense and antisense transcripts are expressed at the same time in the same cell in response to various developmental and environmental cues, they may form double-stranded RNAs, which could be recognized by the small RNA biogenesis machinery and processed into small interfering RNAs (siRNAs). cis-NATs-derived siRNAs (nat-siRNAs) are present in plants, animals and fungi. In plants, the presence of nat-siRNAs is supported not only by Northern blot and genetic analyses, but also by the fact that there is an overall six-fold enrichment of siRNAs in the overlapping regions of cis-NATs and 19-29% of the siRNA-generating cis-NATs in plants give rise to siRNAs only in their overlapping regions. Silencing mediated by nat-siRNAs is one of the mechanisms for regulating the expression of the cis-NATs. This review focuses on challenging issues related to the biogenesis mechanisms as well as regulation and detection of nat-siRNAs. The advantages and limitations of new technologies for detecting cis-NATs, including direct RNA sequencing and strand-specific RNA-sequencing, are also discussed.
2013: A Graph-Theoretical Approach to the Selection of the Minimum Tiling Path from a Physical Map. S. Bozdag, T. J. Close and S. Lonardi
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(2): 352-360, 2013. PDF. doi:10.1109/TCBB.2013.26.

The problem of computing the minimum tiling path (MTP) from a set of clones arranged in a physical map is a cornerstone of hierarchical (clone-by-clone) genome sequencing projects. We formulate this problem in a graph theoretical framework, and then solve by a combination of minimum hitting set and minimum spanning tree algorithms. The tool implementing this strategy, called FMTP, shows improved performance compared to the widely used software FPC. When we execute FMTP and FPC on the same physical map, the MTP produced by FMTP covers a higher portion of the genome, and uses a smaller number of clones. For instance, on the rice genome the MTP produced by our tool would reduce by about 11% the cost of a clone-by-clone sequencing project. Source code, benchmark datasets, and documentation of FMTP are freely available at http://code.google.com/p/fingerprint-based-minimal-tiling-path/ under MIT license
2013: Combinatorial Pooling Enables Selective Sequencing of the Barley Gene Space. S. Lonardi, D. Duma, M. Alpert, F. Cordero, M. Beccuti, P. R. Bhat, Y. Wu, G. Ciardo, B. Alsaihati, Y. Ma, S. Wanamaker, J. Resnik, S. Bozdag, M-C. Luo, T. J. Close
PLoS Computational Biology, 9(4): e1003010, 2011. PDF. doi:10.1371/journal.pcbi.1003010.

For the vast majority of species - including many economically or ecologically important organisms, progress in biological research is hampered due to the lack of a reference genome sequence. Despite recent advances in sequencing technologies, several factors still limit the availability of such a critical resource. At the same time, many research groups and international consortia have already produced BAC libraries and physical maps and now are in a position to proceed with the development of whole-genome sequences organized around a physical map anchored to a genetic map. We propose a BAC-by-BAC sequencing protocol that combines combinatorial pooling design and second-generation sequencing technology to efficiently approach denovo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when preparing sequencing libraries for hundreds or thousands of DNA samples, such as in this case gene-bearing minimum-tiling-path BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundred millions of short reads and assign them to the correct BAC clones (deconvolution) so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is very accurate, and the resulting BAC assemblies have high quality. Results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate and the BAC assemblies have good quality. While our method cannot provide the level of completeness that one would achieve with a comprehensive whole-genome sequencing project, we show that it is quite successful in reconstructing the gene sequences within BACs. In the case of plants such as barley, this level of sequence knowledge is sufficient to support critical end-point objectives such as map-based cloning and marker-assisted breeding.
2012: A physical, genetic and functional sequence assembly of the barley genome. The International Barley Genome Sequencing Consortium
Nature, 490(7422): pp.711-716, 2012. PDF. Supplemental. doi:10.1038/nature11543.

Barley (Hordeum vulgare L.) is among the world's earliest domesticated and most important crop plants. It is diploid with a large haploid genome of 5.1 gigabases (Gb). Here we present an integrated and ordered physical, genetic and functional sequence resource that describes the barley gene-space in a structured whole-genome context. We developed a physical map of 4.98Gb, with more than 3.90Gb anchored to a high-resolution genetic map. Projecting a deep whole-genome shotgun assembly, complementary DNA and deep RNA sequence data onto this framework supports 79,379 transcript clusters, including 26,159 'high-confidence' genes with homology support from other plant genomes. Abundant alternative splicing, premature termination codons and novel transcriptionally active regions suggest that post-transcriptional processing forms an important regulatory layer. Survey sequences from diverse accessions reveal a landscape of extensive single-nucleotide variation. Our data provide a platform for both genome-assisted research and enabling contemporary crop improvement.
2012: BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads. E.Y. Harris, N. Ponts, K.G. Le Roch, S. Lonardi
Bioinformatics, 28(13): 1795-1796, 2011. PDF. Supplemental doi:10.1093/bioinformatics/bts264.

We introduce BRAT-BW, a fast, accurate and memory-efficient tool that maps bisulfite-treated short reads (BS-seq) to a reference genome using the FM-index (Burrows-Wheeler transform). BRAT-BW is significantly more memory efficient and faster on longer reads than current state-of-the-art tools for BS-seq data, without compromising on accuracy. BRAT-BW is a part of a software suite for genome-wide single base-resolution methylation data analysis that supports single and paired-end reads and includes a tool for estimation of methylation level at each cytosine.
2012: NOrMAL: Accurate Nucleosome Positioning using a Modified Gaussian Mixture Model. A. Polishko, N. Ponts, K. G. Le Roch, S. Lonardi
ISMB 2012 - Proceedings of Annual International Conference on Intelligent Systems for Molecular Biology, i242-i249, Long Beach, CA, 2012. Special Issue of Bioinformatics, 28(12): i242-i249, 2012. PDF. doi:10.1093/bioinformatics/bts206.

Nucleosomes are the basic elements of chromatin structure. They control the packaging of DNA and play a critical role in gene regulation by allowing physical access to transcription factors. The advent of second-generation sequencing has enabled landmark genome-wide studies of nucleosome positions for several model organisms. Current methods to determine nucleosome positioning first compute an occupancy coverage profile by mapping nucleosome-enriched sequenced reads to a reference genome; then, nucleosomes are placed according to the peaks of the coverage profile. These methods are quite accurate on placing isolated nucleosomes, but they do not properly handle more complex configurations. Also, they can only provide the positions of nucleosomes and their occupancy level, while it is very beneficial to supply molecular biologists additional information about nucleosomes like the probability of placement, the size of DNA fragments enriched for nucleosomes, and/or whether nucleosomes are well-positioned or "fuzzy" in the sequenced cell sample. Results: We address these issues by providing a novel method based on a parametric probabilistic model. An expectation maximization algorithm is used to infer the parameters of the mixture of distributions. We compare the performance of our method on two real datasets against Template Filtering, which is considered the current state-of-the-art. On synthetic data, we show that our method can resolve more accurately complex configurations of nucleosomes, and it is more robust to user-defined parameters. On real data, we show that our method detects a significantly higher number of nucleosomes.
2011: Chromatin-driven de novo discovery of DNA binding motifs in the human malaria parasite. E.Y. Harris, N. Ponts, K.G. Le Roch, S. Lonardi
BMC Genomics, 12:601, 2011. PDF. doi:10.1186/1471-2164-12-601.

Background: Despite extensive efforts to discover transcription factors and their binding sites in the human malaria parasite Plasmodium falciparum, only a few transcription factor-binding motifs have been experimentally validated to date. As a consequence, gene regulation in P. falciparum is still poorly understood. There is now evidence that the chromatin architecture plays an important role in transcriptional control in malaria. Results: We propose a methodology for discovering cis-regulatory elements that uses for the first time exclusively dynamic chromatin remodeling data. Our method employs nucleosome positioning data collected at seven time points during the erythrocytic cycle of P. falciparum to discover putative DNA binding motifs and their transcription factor binding sites along with their associated clusters of target genes. Our approach results in 129 putative binding motifs within the promoter region of known genes. About 75% of those are novel, the remaining being highly similar to experimentally validated binding motifs. About half of the binding motifs reported show statistically significant enrichment in functional gene sets and strong positional bias in the promoter region. Conclusion: Experimental results establish the principle that dynamic chromatin remodeling data can be used in lieu of gene expression data to discover binding motifs and their transcription factor binding sites. Our approach can be applied using only dynamic nucleosome positioning data, independent from any knowledge of gene function or expression.
2011: An improved consensus linkage map of barley based on flow-sorted chromosomes and single nucleotide polymorphism markers. M. Munoz-Amatriain, M. J. Moscou, P. R. Bhat, J. T. Svensson, J. Bartos, P. Suchankova, H. Simkova, T. R. Endo, R. D. Fenton, Y. Wu, S. Lonardi, A.M. Castillo, S. Chao, L. Cistue, A. Cuesta-Marcos, K. Forrest, M. J. Hayden, P. M. Hayes, R. D. Horsley, A. Kleinhofs, D. Moody, K. Sato, M. P. Valles, B. B. H. Wulff, G. J. Muehlbauer, J. Dolezel, T. J. Close
The Plant Genome, 4(3): 238-249, 2011. PDF. doi:10.3835/plantgenome2011.08.0023.

Recent advances in high-throughput genotyping have made it easier to combine information from different mapping populations into consensus genetic maps, which provide increased marker density and genome coverage compared to individual maps. Previously, a single nucleotide polymorphism (SNP)-based genotyping platform was developed and used to genotype 373 individuals in four barley (Hordeum vulgare L.) mapping populations. This led to a 2943 SNP consensus genetic map with 975 unique positions. In this work, we add data from six additional populations and more individuals from one of the original populations to develop an improved consensus map from 1133 individuals. A stringent and systematic analysis of each of the 10 populations was performed to achieve uniformity. This involved reexamination of the four populations included in the previous map. As a consequence, we present a robust consensus genetic map that contains 2994 SNP loci mapped to 1163 unique positions. The map spans 1137.3 cM with an average density of one marker bin per 0.99 cM. A novel application of the genotyping platform for gene detection allowed the assignment of 2930 genes to flow-sorted chromosomes or arms, confirmed the position of 2545 SNP-mapped loci, added chromosome or arm allocations to an additional 370 SNP loci, and delineated pericentromeric regions for chromosomes 2H to 7H. Marker order has been improved and map resolution has been increased by almost 20%. These increased precision outcomes enable more optimized SNP selection for markerassisted breeding and support association genetic analysis and map-based cloning. It will also improve the anchoring of DNA sequence scaffolds and the barley physical map to the genetic map.
2011: Nucleosome occupancy at transcription start sites in the human malaria parasite: a hard-wired evolution of virulence? N. Ponts, E. Y. Harris, S. Lonardi, K. G. Le Roch
Infection, Genetics and Evolution, 11(4): 716-724, 2011. PDF. doi:10.1016/j.meegid.2010.08.002.

Almost a decade after the publication of the complete sequence of the genome of the human malaria parasite Plasmodium falciparum, the mechanisms involved in gene regulation remain poorly understood. Like other eukaryotic organisms, P. falciparum's genomic DNA organizes into nucleosomes. Nucleosomes are the basic structural units of eukaryotic chromatin and their regulation is known to play a key role in regulation of gene expression. Despite its importance, the relationship between nucleosome positioning and gene regulation in the malaria parasite has only been investigated recently. Using two independent and complementary techniques followed by next-generation high-throughput sequencing, our laboratory recently generated a dynamic atlas of nucleosome-bound and nucleosome-free regions (NFRs) at single-nucleotide resolution throughout the parasite erythrocytic cycle. We have found evidences that genome-wide changes in nucleosome occupancy play a critical role in controlling the rigorous parasite replication in infected red blood cells. However, the role of nucleosome positioning at remarkable locations such as transcriptional start sites (TSS) was not investigated. Here we show that a study of NFR in experimentally determined TSS and in silico-predicted promoters can provide deeper insights of how a transcriptionally permissive organization of chromatin can control the parasite's progression through its life cycle. We find that NFRs found at TSS and core promoters are strongly associated with high levels of gene expression in asexual erythrocytic stages, whereas nucleosome-bound TSSs and promoters are associated with silent genes preferentially expressed in sexual stages. The implications in terms of regulatory evolution, adaptation of gene expression and their impact in the design of antimalarial strategies are discussed.
2011: String matching in hardware using the FM-index. E. B. Fernandez, W. Najjar, S. Lonardi
FCCM 2011 - IEEE International Symposium on Field-Programmable Custom Computing Machines, 218-225, Salt Lake City, Utah, 2011. Proceedings. PDF. doi:10.1109/FCCM.2011.55 .

String matching is a ubiquitous problem that arises in a wide range of applications in computing, e.g., packet routing, intrusion detection, web querying, and genome analysis. Due to its importance, dozens of algorithms and several data structures have been developed over the years. A recent breakthrough in this field is the FM-index, a data structure that synergistically combines the Burrows-Wheeler transform and the suffix array. In software, the FM-index allows searching (exact and approximate) in times comparable to the fastest known indices for large texts (suffix trees and suffix arrays), but has the additional advantage of being more space-efficient than those approaches. In this paper, we describe the first FPGA-based hardware implementation of the FM-index for exact pattern matching. We report experimental results on the problem of mapping short DNA sequences to a reference genome. We show that the throughput of the FM-index is significantly higher than the naive (brute force) approach. Like the Bowtie software tool, the FM-index can abandon early the hardware matching. It outperforms Bowtie by two orders of magnitude.
2011: Accurate construction of consensus genetic maps via integer linear programming. Y. Wu, T. J. Close, S. Lonardi
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(2): 381-394, 2011. doi:10.1109/TCBB.2010.35.

We study the problem of merging genetic maps, when the individual genetic maps are given as directed acyclic graphs. The computational problem is to build a consensus map, which is a directed graph that includes and is consistent with all (or, the vast majority of) the markers in the input maps. However, when markers in the individual maps have ordering conflicts, the resulting consensus map will contain cycles. Here, we formulate the problem of resolving cycles in the context of a parsimonious paradigm that takes into account two types of errors that may be present in the input maps, namely local reshuffles and global displacements. The resulting combinatorial optimization problem is in turn expressed as an integer linear program. A fast approximation algorithm is proposed, and an additional speed-up heuristic is developed. Our algorithms were implemented in a software tool named MergeMap which is freely available for academic use. An extensive set of experiments show that MergeMap consistently outperforms JoinMap, which is the most popular tool currently available for this task, both in terms of accuracy and running time. MergeMap is available for download at http://www.cs.ucr.edu/~yonghui/mgmap.html.
2010: Exploration of short reads mapping in hardware. E. B. Fernandez, E. Harris, W. Najjar, S. Lonardi
FPL 2010 - International Conference on Field Programmable Logic and Applications, 360-363, Milano, Italy, 2010. Proceedings. PDF. doi:10.1109/FPL.2010.78.

The newest generation of sequencing instruments, such as Illumina/Solexa Genome Analyzer and ABI SOLiD, can generate hundreds of millions of short DNA “reads” from a single run. These reads must be matched against a reference genome to identify their original location. Due to sequencing errors or variations in the sequenced genome, the matching procedure must allow a variable but limited number of mismatches. This problem is a version of the classic approximate string matching where a long text is searched for the occurrence of a set of short patterns. Typical strategies to speed up the matching involve elaborate hashing schemes that exploit the inherent repetitions of the data. However, such large data structures are not well suited for FPGA implementations. In this paper we evaluate an FPGA implementation that uses a “naive” approach which checks every possible read-genome alignment. We compare the performance of the naive approach to popular software tools currently used to map short reads to a reference genome showing a speedup of up to 4X over the fastest software tool.
2010: Novel gene discovery in the human malaria parasite using nucleosome positioning data. N. Pokhriyal, N. Ponts, E. Y. Harris, K. G. Le Roch, S. Lonardi
CSB 2010 - Computational Systems Bioinformatics Conference, pp. 124-135, Stanford, CA, 2010. Proceedings. PDF.

Recent genome-wide studies on nucleosome positioning in model organisms have shown strong evidence that nucleosome landscapes in the proximity of protein-coding genes exhibit regular characteristic patterns. Here, we propose a computational framework to discover novel genes in the human malaria parasite genome using nucleosome positioning inferred from MAINE-seq data. We rely on a classifier trained on the nucleosome landscape profiles of experimentally verified genes, and then used to discover new genes (without considering the primary DNA sequence). Cross-validation experiments show that our classifier is very accurate. About two thirds of the locations reported by the classifier match experimentally determined expressed sequence tags in GenBank, for which no gene has been annotated in the human malaria parasite.
2010: Nucleosome landscape and control of transcription in the human malaria parasite. N. Ponts, E. Y. Harris, J. Prudhomme, I. Wick, C. Eckhardt-Ludka, G. Hicks, G. Hardiman, S. Lonardi, and K. Le Roch
Genome Research, 20: 228-238, 2010. Supplemental doi:10.1101/gr.101063.109.

In eukaryotic cells, chromatin reorganizes within promoters of active genes to allow the transcription machinery and various transcription factors to access DNA. In this model, promoter-specific transcription factors bind DNA to initiate the production of mRNA in a tightly regulated manner. In the case of the human malaria parasite, Plasmodium falciparum, specific transcription factors are apparently underrepresented with regards to the size of the genome, and mechanisms underlying transcriptional regulation are controversial. Here, we investigate the modulation of DNA accessibility by chromatin remodeling during the parasite infection cycle. We have generated genome-wide maps of nucleosome occupancy across the parasite erythrocytic cycle using two complementary assays—the formaldehyde-assisted isolation of regulatory elements to extract protein-free DNA (FAIRE) and the MNase-mediated purification of mononucleosomes to extract histone-bound DNA (MAINE), both techniques being coupled to high-throughput sequencing. We show that chromatin architecture undergoes drastic upheavals throughout the parasite's cycle, contrasting with targeted chromatin reorganization usually observed in eukaryotes. Chromatin loosens after the invasion of the red blood cell and then repacks prior to the next cycle. Changes in nucleosome occupancy within promoter regions follow this genome-wide pattern, with a few exceptions such as the var genes involved in virulence and genes expressed at early stages of the cycle. We postulate that chromatin structure and nucleosome turnover control massive transcription during the erythrocytic cycle. Our results demonstrate that the processes driving gene expression in Plasmodium challenge the classical eukaryotic model of transcriptional regulation occurring mostly at the transcription initiation level.
2010: BRAT: Bisulfite-treated reads analysis tool. E. Y. Harris, N. Ponts, A. Levchuk, K. Le Roch, S. Lonardi
Bioinformatics, 26(4): 572-573, 2010. Supplemental doi:10.1093/bioinformatics/btp706.

We present a new, accurate and efficient tool for mapping short reads obtained from the Illumina Genome Analyzer following sodium bisulfite conversion. Our tool, BRAT, supports single and paired-end reads and handles input files containing reads and mates of different lengths. BRAT is faster, maps more unique paired-end reads and has higher accuracy than existing programs. The software package includes tools to end-trim low quality bases of the reads and to report nucleotide counts for mapped reads on the reference genome.
2010: Efficient algorithms for genome-wide tagSNP selection across populations via the linkage disequilibrium criterion. L. Liu, Y. Wu, S. Lonardi, T. Jiang
Journal Computational Biology, 17(1): 21-37, 2010. doi:10.1089/cmb.2007.0228

In this paper, we study the tagSNP selection problem on multiple populations using the pairwise r^2 linkage disequilibrium criterion. We propose a novel combinatorial optimization model for the tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem, and present efficient solutions for MCTS. Our approach consists of three main steps including (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e. the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time tagging lower bounds are discussed in the literature. We assess the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing single-population tagging programs like FESTA, LD-Select and the multiple-population tagging method MultiPop-TagSelect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-TagSelect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds obtained by our method.
2010: Graphlet kernels for prediction of functional residues in protein structures. V. Vacic, L. M. Iakoucheva, S. Lonardi, P. Radivojac
Journal Computational Biology, 17(1): 55-72, 2010. Supplemental Data doi:10.1089/cmb.2009.0029.

We introduce a novel graph-based kernel method for annotating functional residues in protein structures. A structure is first modeled as a protein contact graph, where nodes correspond to residues and edges connect spatially neighboring residues. Each vertex in the graph is then represented as a vector of counts of labeled non‐isomorphic subgraphs (graphlets), centered on the vertex of interest. A similarity measure between two vertices is expressed as the inner product of their respective count vectors and is used in a supervised learning framework to classify protein residues. We evaluated our method on two function prediction problems: identification of catalytic residues in proteins, which is a well-studied problem suitable for benchmarking, and a much less explored problem of predicting phosphorylation sites in protein structures. The performance of the graphlet kernel approach was then compared against two alternative methods, a sequence‐based predictor and our implementation of the FEATURE framework. On both tasks the graphlet kernel performed favorably; however, the margin of difference was considerably higher on the problem of phosphorylation site prediction. While there is data that phosphorylation sites are preferentially positioned in intrinsically disordered regions, we provide evidence that for the sites that are located in structured regions, neither the surface accessibility alone nor the averaged measures calculated from the residue microenvironments utilized by FEATURE were sufficient to achieve high accuracy. The key benefit of the graphlet representation is its ability to capture neighborhood similarities in protein structures via enumerating the patterns of local connectivity in the corresponding labeled graphs.
2009: Development and implementation of high-throughput SNP genotyping in barley. T. J. Close, P. R. Bhat, S. Lonardi, Y. Wu, N. Rostoks, L. Ramsay, A. Druka, N. Stein, J. T. Svensson, S. Wanamaker, S. Bozdag, M. L. Roose, M. J. Moscou, S. Chao, R. Varshney, P. Szucs, K. Sato, P. M. Hayes, D. E. Matthews, A. Kleinhofs, G. J. Muehlbauer, J. DeYoung, D. F. Marshall, K. Madishetty, R. D. Fenton, P. Condamine, A. Graner, R. Waugh
BMC Genomics, 10 582, 2009. doi:10.1186/1471-2164-10-582.

High density genetic maps of plants have, nearly without exception, made use of marker datasets containing missing or questionable genotype calls derived from a variety of genic and non-genic or anonymous markers, and been presented as a single linear order of genetic loci for each linkage group. The consequences of missing or erroneous data include falsely separated markers, expansion of cM distances and incorrect marker order. These imperfections are amplified in consensus maps and problematic when fine resolution is critical including comparative genome analyses and map-based cloning. Here we provide a new paradigm, a high-density consensus genetic map of barley based only on complete and error-free datasets and genic markers, represented accurately by graphs and approximately by a best-fit linear order, and supported by a readily available SNP genotyping resource. Approximately 22,000 SNPs were identified from barley ESTs and sequenced amplicons; 4,596 of them were tested for performance in three pilot phase Illumina GoldenGate assays. Data from three barley doubled haploid mapping populations supported the production of an initial consensus map. Over 200 germplasm selections, principally European and US breeding material, were used to estimate minor allele frequency (MAF) for each SNP. We selected 3,072 of these tested SNPs based on technical performance, map location, MAF and biological interest to fill two 1536- SNP “production” assays (BOPA1 and BOPA2), which were made available to the barley genetics community. Data were added using BOPA1 from a fourth mapping population to yield a consensus map containing 2,943 SNP loci in 975 marker bins covering a genetic distance of 1099 cM. The unprecedented density of genic markers and marker bins enabled a high resolution comparison of the genomes of barley and rice. Low recombination in pericentric regions is evident from bins containing many more than the average number of markers, meaning that a large number of genes are recombinationally locked into the genetic centromeric regions of several barley chromosomes. Examination of US breeding germplasm illustrated the usefulness of BOPA1 and BOPA2 in that they provide excellent marker density and sensitivity for detection of minor alleles in this genetically narrow material.
2009: Immune profile and mitotic index of metastatic melanoma lesions enhance TNM staging in predicting patient survival. D. Bogunovic, D. W. O'Neill, I. Belitskaya-Levy, V. Vacic, Y.-L. Yu, S. Adams, F. Darvishian, R. Berman, R. Shapiro, A. Pavlick, S. Lonardi, J. Zavadil, I. Osman and N. Bhardwaj
Proceedings of the National Academy of Sciences (PNAS), 106(48): 20429--20434, 2009. doi:10.1073/pnas.0905139106.

Although remission rates for metastatic melanoma are generally very poor, some patients can survive for prolonged periods following metastasis. We used gene expression profiling, mitotic index (MI), and quantification of tumor infiltrating leukocytes (TILs) and CD3+ cells in metastatic lesions to search for a molecular basis for this observation and to develop improved methods for predicting patient survival. We identified a group of 266 genes associated with postrecurrence survival. Genes positively associated with survival were predominantly immune response related (e.g., ICOS, CD3d, ZAP70, TRAT1, TARP, GZMK, LCK, CD2, CXCL13, CCL19, CCR7, VCAM1) while genes negatively associated with survival were cell proliferation related (e.g., PDE4D, CDK2, GREF1, NUSAP1, SPC24). Furthermore, any of the 4 parameters (prevalidated gene expression signature, TILs, CD3, and in particular MI) improved the ability of Tumor, Node, Metastasis (TNM) staging to predict postrecurrence survival; MI was the most significant contributor (HR = 2.13, P = 0.0008). An immune response gene expression signature and presence of TILs and CD3+ cells signify immune surveillance as a mechanism for prolonged survival in these patients and indicate improved patient subcategorization beyond current TNM staging.
2009: A compartmentalized approach to the assembly of physical maps. S. Bozdag, T.J. Close, S. Lonardi
BMC Bioinformatics, 10 217, 2009. doi:10.1186/1471-2105-10-217. BibTex entry.

Background: Physical maps have been historically one of the cornerstones of genome sequencing and map-based cloning strategies. They also support marker assisted breeding and EST mapping. The problem of building a high quality physical map is computationally challenging due to unavoidable noise in the input fingerprint data. Results: We propose a novel compartmentalized method for the assembly of high quality physical maps from fingerprinted clones. The knowledge of genetic markers enables us to group clones into clusters so that clones in the same cluster are more likely to overlap. For each cluster of clones, a local physical map is first constructed using FingerPrinted Contigs (FPC). Then, all the individual maps are carefully merged into the final physical map. Experimental results on the genomes of rice and barley demonstrate that the compartmentalized assembly produces significantly more accurate maps, and that it can detect and isolate clones that would induce "chimeric" contigs if used in the final assembly. Conclusion: The software is available for download at http://www.cs.ucr.edu/~sbozdag/assembler/
2008: Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. Y.Wu, P.Bhat, T.J.Close, S.Lonardi
PLoS Genetics, 4(10):e1000212, 2008. doi:10.1371/journal.pgen.1000212. BibTex entry. Faculty of 1000 review.

Genetic linkage maps are cornerstones of a wide spectrum of biotechnology applications, including map-assisted breeding, association genetics, and map-assisted gene cloning. During the past several years, the adoption of high-throughput genotyping technologies has been paralleled by a substantial increase in the density and diversity of genetic markers. New genetic mapping algorithms are needed in order to efficiently process these large datasets and accurately construct high-density genetic maps. In this paper, we introduce a novel algorithm to order markers on a genetic linkage map. Our method is based on a simple yet fundamental mathematical property that we prove under rather general assumptions. The validity of this property allows one to determine efficiently the correct order of markers by computing the minimum spanning tree of an associated graph. Our empirical studies obtained on genotyping data for three mapping populations of barley (Hordeum vulgare), as well as extensive simulations on synthetic data, show that our algorithm consistently outperforms the best available methods in the literature, particularly when the input data are noisy or incomplete. The software implementing our algorithm is available in the public domain as a web tool under the name MSTMAP.
2008: Computing the minimal tiling path from a physical map by integer linear programming. S.Bozdag, T.Close, S.Lonardi
WABI 2008 - Workshop on Algorithms in Bioinformatics, LNBI 5251, 148-161, Universitat Karlsruhe, Germany, 2008. doi:10.1007/978-3-540-87361-7_13. BibTex entry.

We study the problem of selecting the minimum tiling path (MTP) from a set of clones arranged in a physical map. We formulate the constraints of the MTP problem in a graph theoretical framework, and we derive an optimization problem that is solved via integer linear programming. Experimental results show that when we compare our algorithm to the commonly used software FPC, the MTP produced by our method covers a higher portion of the genome, even using a smaller number of MTP clones. These results suggest that if one would employ the MTP produced by our method instead of FPC's in a clone-by-clone sequencing project, one would reduce by about 12% the sequencing cost.
2008: On the accurate construction of consensus genetic maps. Y. Wu, T. J. Close, S. Lonardi
CSB 2008 - Computational Systems Bioinformatics Conference, 285-296, Stanford, CA, 2008. Proceedings. BibTex entry.

We study the problem of merging genetic maps, when the individual genetic maps are given as directed acyclic graphs. The problem is to build a consensus map, which includes and is consistent with all (or, the majority of) the markers in the individual maps. When markers in the input maps have ordering conflicts, the resulting consensus map will contain cycles. We formulate the problem of resolving the cycles in a combinatorial optimization framework, which in turn is expressed as an integer linear program. A faster approximation algorithm is proposed, and an additional speed-up heuristic is developed. According to an extensive set of experimental results, our tool is consistently better than JoinMap, both in terms of accuracy and running time.
2008: A linear-time algorithm for predicting functional annotations from protein-protein interaction networks. Y. Wu, S. Lonardi
Journal of Bioinformatics and Computational Biology, 6(6) 1049-1065, 2008. doi:10.1142/S0219720008003916. BibTex entry.

Recent proteome-wide screening efforts have made available genome-wide, high-throughput protein-protein interaction (PPI) maps for several model organisms. This has enabled the systematic analysis of PPI networks, which has become one of the primary challenges for the system biology community. Here we address the problem of predicting the functional classes of proteins (i.e., GO annotations) based solely on the structure of the PPI network. We present a maximum likelihood formulation of the problem and the corresponding learning and inference algorithms. The time complexity of both algorithms is linear in the size of the PPI network and experimental results show that their accuracy in the functional prediction outperforms current existing methods.
2008: Deconvoluting BAC-gene relationships using a physical map. Y. Wu, L. Liu, T. Close, S. Lonardi
Journal of Bioinformatics and Computational Biology, 6(3) 603-622, 2008. doi:10.1142/S0219720008003564. BibTex entry.

The deconvolution of the relationships between BAC clones and genes is a crucial step in the selective sequencing of the regions of interest in a genome. It usually requires combinatorial pooling of unique probes obtained from the genes (unigenes), and the screening of the BAC library using the pools in a hybridization experiment. Since several probes can hybridize to the same BAC, in order for the deconvolution to be achievable the pooling design has to be able to handle a large number of positives. As a consequence, smaller pools need to be designed which in turns increases the number of hybridization experiments possibly making the entire protocol unfeasible. We propose a new algorithm that is capable of producing high accuracy deconvolution even in the presence of a weak pooling design, i.e., when pools are rather large. The algorithm compensates for the decrease of information in the hybridization data by taking advantage of a physical map of the BAC clones. We show that the right combination of combinatorial pooling and our algorithm not only dramatically reduces the number of pools required, but also successfully deconvolutes the BAC-gene relationships with almost perfect accuracy.
2008: Length-based encoding of binary data in DNA. N. G. Portney, Y. Wu, L. K. Quezada, S. Lonardi, M. Ozkan
Langmuir, 24(5): 1613-1616, 2008. doi:10.1021/la703235y. BibTex entry.

We developed a system to encode digital information in DNA polymers based on the partial restriction digest (PRD). Our encoding method relies on the length of the fragments obtained by the PRD rather than the actual content of the nucleotide sequence, thus eliminating the need for expensive sequencing machinery. In this letter, we report on the encoding of 12 bits of data in a DNA fragment of 110 nucleotides and the process of recovering the data.
2008: Small RNAs and the regulation of cis-natural antisense transcripts in Arabidopsis. H. Jin, V. Vacic, T. Girke, S. Lonardi, J.-K. Zhu
BMC Molecular Biology, 9:6, 2007. doi:10.1186/1471-2199-9-6. BibTex entry.

In spite of large intergenic spaces in plant and animal genomes, 7% to 30% of genes in the genomes encode overlapping cis-natural antisense transcripts (cis-NATs). The widespread occurrence of cis-NATs suggests an evolutionary advantage for this type of genomic arrangement. Experimental evidence for the regulation of two cis-NAT gene pairs by natural antisense transcripts-generated small interfering RNAs (nat-siRNAs) via the RNA interference (RNAi) pathway has been reported in Arabidopsis. However, the extent of siRNA-mediated regulation of cis-NAT genes is still unclear in any genome. The hallmarks of RNAi regulation of NATs are 1) inverse regulation of two genes in a cis-NAT pair by environmental and developmental cues and 2) generation of siRNAs by cis-NAT genes. We examined Arabidopsis transcript profiling data from public microarray databases to identify cis-NAT pairs whose sense and antisense transcripts show opposite expression changes. A subset of the cis-NAT genes displayed negatively correlated expression profiles as well as inverse differential expression changes under at least one of the examined developmental stages or treatment conditions. By searching the Arabidopsis Small RNA Project (ASRP) and Massively Parallel Signature Sequencing (MPSS) small RNA databases as well as our stress-treated small RNA dataset, we found small RNAs that matched at least one gene in 646 pairs out of 1008 (64%) protein-coding cis-NAT pairs, which suggests that siRNAs may regulate the expression of many cis-NAT genes. 209 putative siRNAs have the potential to target more than one gene and half of these small RNAs could target multiple members of a gene family. Furthermore, the majority of the putative siRNAs within the overlapping regions tend to target only one transcript of a given NAT pair, which is consistent with our previous finding on salt- and bacteria-induced nat-siRNAs. In addition, we found that genes encoding plastid- or mitochondrion-targeted proteins are over-represented in the Arabidopsis cis-NATs and that 19% of sense and antisense partner genes of cis-NATs share at least one common Gene Ontology term, which suggests that they encode proteins with possible functional connection. The negatively correlated expression patterns of sense and antisense genes as well as the presence of siRNAs in many of the cis-NATs suggest that small RNA-mediated regulation of cis-NATs via the RNAi pathway is an important gene regulatory mechanism for at least a subgroup of cis-NATs in Arabidopsis.
2008: A probabilistic method for small RNA flowgram matching. V. Vacic, H. Jin, J.-K. Zhu, S. Lonardi
PSB 2008 - Pacific Symposium on Biocomputing, 75-86, Big Island of Hawaii, 2008. BibTex entry.

The 454 pyrosequencing technology is gaining popularity as an alternative to traditional Sanger sequencing. While each method has comparative advantages over the other, certain properties of the 454 method make it particularly well suited for small RNA discovery. We here describe some of the details of the 454 sequencing technique, with an emphasis on the nature of the intrinsic sequencing errors and methods for mitigating their effect. We propose a probabilistic framework for small RNA discovery, based on matching 454 flowgrams against the target genome. We formulate flowgram matching as an analog of profile matching, and adapt several profile matching techniques for the task of matching flowgrams. As a result, we are able to recover some of the hits missed by existing methods and assign probability-based scores to them.
2007: A compartmentalized approach to the assembly of physical maps. S. Bozdag, T.J. Close, S. Lonardi
BIBE 2007 - IEEE International Symposium on Bioinformatics & Bioengineering, 218-225, Boston, MA, 2007. doi:10.1109/BIBE.2007.4375568. BibTex entry.

Conference version of the BMC Bioinformatics 2009 paper above.
2007: A linear-time algorithm for predicting functional annotations from protein-protein interaction networks. Y. Wu, S. Lonardi
BIOKDD 2007 - 7th International Workshop on Data Mining in Bioinformatics, 35-41, San Jose, 2007. BibTex entry.

Conference version of the JBCB paper above.
2007: A decomposition approach to the identification of network building modules. Q. Yang, S. Lonardi
BIOKDD 2007 - 7th International Workshop on Data Mining in Bioinformatics, 50-59, San Jose, 2007. BibTex entry.

The increasing availability of biological networks (protein-protein interaction graphs, metabolic and transcriptional networks, etc.) is offering new opportunities to analyze their topological properties and possibly gain new insights in their design principles. Here we concentrate on the problem of de novo identification of the building modules of networks, which we refer to as network modules. We propose a novel graph decomposition algorithm based on the notion of edge betweenness that discovers network modules without assuming any a priori knowledge. We claim that the knowledge of the distribution of network modules carries more information than the distribution of subgraphs which is commonly-used in the literature. To demonstrate the effectiveness of the statistics based on network modules, we show that our method is capable of clustering more accurately networks known to have distinct topologies, and that the number of informative components in our feature vector is significantly higher. We also show that our approach is very robust to structural perturbations (i.e., edge rewiring) to the network. When we apply our algorithm to protein-protein interaction (PPI) networks, our decomposition method identifies highly connected network modules that occur significantly more frequently than those found in the corresponding random networks. Detailed inspection of the functions of the over-represented network modules in S. cerevisiae PPI network shows that the proteins involved in the modules either belong to the same cellular complex or share biological functions with high similarity. A comparative analysis of PPI networks against AS-level internet graphs shows that in AS-level networks highly connected network modules are less frequent but more tightly connected with each other.
2007: Composition profiler: A tool for discovery and visualization of amino acid composition differences. V. Vacic, V. N. Uversky, A. K. Dunker, S. Lonardi
BMC Bioinformatics, 8:211, 2007. doi:10.1186/1471-2105-8-211. PubMed abstract. BibTex entry.

Composition Profiler is a web-based tool for semi-automatic discovery of enrichment or depletion of amino acids, either individually or grouped by their physico-chemical or structural properties. The program takes two samples of amino acids as input: a query sample and a reference sample. The latter provides a suitable background amino acid distribution, and should be chosen according to the nature of the query sample, for example, a standard protein database (e.g. SwissProt, PDB), a representative sample of proteins from the organism under study, or a group of proteins with a contrasting functional annotation. The results of the analysis of amino acid composition differences are summarized in textual and graphical form. As an exploratory data mining tool, our software can be used to guide feature selection for protein function or structure predictors. For classes of proteins with significant differences in frequencies of amino acids having particular physico-chemical (e.g. hydrophobicity or charge) or structural (e.g. alpha helix propensity) properties, Composition Profiler can be used as a rough, light-weight visual classifier.
2007: Efficient and accurate construction of genetic linkage maps from noisy and missing genotyping data. Y. Wu, P. Bhat, T. J Close, S. Lonardi
WABI 2007 - Workshop on Algorithms in Bioinformatics, LNBI 4645, pp.395-406, Philadelphia PA, 2007. BibTex entry.

Conference version of the PLoS Genetics 2008 article above.
2007: Efficient algorithms for genome-wide tagSNP selection across populations via the linkage disequilibrium criterion. L. Liu, Y. Wu, S. Lonardi, T. Jiang
CSB 2007 - Computational Systems Bioinformatics Conference, 67-78, San Diego CA, 2007. PubMed abstract. BibTex entry.

Conference version of the JCB 2009 article above.
2007: Deconvoluting the BAC-gene relationships using a physical map. Y. Wu, L. Liu, T. Close, S. Lonardi
CSB 2007 - Computational Systems Bioinformatics Conference, 203-214, San Diego CA, 2007. PubMed abstract. BibTex entry.

Conference version of the JBCB paper above.
2007: A parallel algorithm for clustering protein-protein interaction networks. Q. Yang, S. Lonardi
International Journal of Data Mining and Bioinformatics, 1(3) 241-247, 2007. BibTex entry.

The increasing availability of interaction graphs requires new resource-efficient tools capable of extracting valuable biological knowledge from these networks. In this paper we report on a novel parallel implementation of Girvan and Newman's clustering algorithm that is capable of running on clusters of computers. Our parallel implementation achieves almost linear speed-up up to 32 processors and allows us to run this computationally intensive algorithm on large protein-protein interaction networks. Preliminary experiments show that the algorithm has very high accuracy in identifying functional related protein modules.
2006: Finding biclusters by random projections. S. Lonardi, W. Szpankowski, Q. Yang
Theoretical Computer Science, 368(3): 217-230, 2006. doi:10.1016/j.tcs.2006.09.023. BibTex entry.

Given a matrix X composed of symbols, a bicluster is a submatrix of X obtained by removing some of the rows and some of the columns of X in such a way that each row of what is left reads the same string. In this paper, we are concerned with the problem of finding the bicluster with the largest area in a large matrix X. The problem is first proved to be NP-complete. We present a fast and efficient randomized algorithm that discovers the largest bicluster by random projections. A detailed probabilistic analysis of the algorithm and an asymptotic study of the statistical significance of the solutions are given. We report results of extensive simulations on synthetic data.
2006: Evolution versus "intelligent design": Comparing the topology of protein-protein interaction networks to the Internet. Q. Yang, G. Siganos, M. Faloutsos, S. Lonardi
CSB 2006 - Computational Systems Bioinformatics Conference, 299-310, Stanford, CA, 2006. Proceedings. BibTex entry.

In this study, we attempt to understand better the topological structure of PPI networks by comparing them against man-made communication networks, and more specifically, the Internet. Our comparative study is based on a comprehensive set of graph metrics. Our results exhibit an interesting dichotomy. On the one hand, both networks share several macroscopic properties such as scale-free and small-world properties. On the other hand, the two networks exhibit significant topological differences, such as the cliqueshness of the highest degree nodes.
2006: Oligospawn: a web-based tool for the design of overgo probes from large unigene databases. J. Zheng, J. T. Svensson, K. Madishetty, T. J. Close, T. Jiang, S. Lonardi
BMC Bioinformatics, 7(7), 2006. doi:10.1186/1471-2105-7-7. PubMed abstract. BibTex entry.

OLIGOSPAWN is a suite of software tools that offers two complementary services, namely (1) the selection of "unique" oligos each of which appears in one unigene but does not occur (exactly or approximately) in any other and (2) the selection of "popular" oligos each of which occurs (exactly or approximately) in as many unigenes as possible
2005: Storage and transmission of microarray images. Y.Luo, S.Lonardi
Drug Discovery Today, 10(23/24):1689-1695, 2005. doi:10.1016/S1359-6446(05)03647-0. PubMed abstract. BibTex entry.

With the recent explosion of interest in microarray technology, massive amounts of microarray images are currently being produced. The storage and transmission of these types of data are becoming increasingly challenging. This article reviews the latest technologies that allow for the compression and storage of microarray images in dedicated database systems.
2005: Discovery of repetitive patterns in DNA with accurate boundaries. J. Zheng, S. Lonardi
BIBE 2005 - IEEE International Symposium on BioInformatics and BioEngineering, 105-112, Minneapolis, Minnesota, 2005. Proceedings. doi:10.1109/BIBE.2005.23. BibTex entry.

The accurate identification of repeats remains a challenging open problem in bioinformatics. Most existing methods of repeat identification either depend on annotated repeat databases or restrict repeats to pairs of similar sequences that are maximal in length. The fundamental flaw in most of the available methods is the lack of a definition that correctly balances the importance of the length and the frequency. In this paper, we propose a new definition of repeats that satisfies both criteria. We give a novel characterization of the building blocks of repeats, called \emph{elementary} repeats, which leads to a natural definition of repeat boundaries. We design efficient algorithms and test them on synthetic and real biological data. Experimental results show that our method is highly accurate and capable of finding repetitive patterns missed by other tools.
2005: A parallel algorithm for clustering protein-protein interaction networks. Q. Yang, S. Lonardi
CSB 2005 - IEEE Computational Systems Bioinformatics Conference, 174-177, Stanford, CA, 2005. Worshop and Poster Abstract. doi:10.1109/CSBW.2005.13. BibTex entry.

Conference version of the IJDMB paper above.
2005: Assignment of orthologous genes via genome rearrangement. X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi, T. Jiang
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(4):302-315, 2005. doi:10.1109/TCBB.2005.48. PubMed abstract. BibTex entry.

The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes of the same families. In this paper, we present a new approach to ortholog assignment that takes into account both sequence similarity and evolutionary events at a genome level, where orthologous genes are assumed to correspond to each other in the most parsimonious evolving scenario under genome rearrangement.
2005: Computing the assignment of orthologous genes via genome rearrangement. X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi, T. Jiang
APBC 2005 - Asia-Pacific Bioinformatics Conference, 363-378, Singapore, 2005. Proceedings. BibTex entry.

Conference version of the IEEE/ACM TCBB paper above.
2004: Gridding and compression of microarray images. S. Lonardi, Y. Luo
CSB 2004 - IEEE Computational Systems Bioinformatics Conference, 122-130, Stanford, CA, 2004. Proceedings. doi:10.1109/CSB.2004.1332424. PubMed abstract. BibTex entry.

With the recent explosion of interest in microarray technology, massive amounts of microarray images are currently being produced. The storage and the transmission of this type of data are becoming increasingly challenging. Here we propose lossless and lossy compression algorithms for microarray images originally digitized at 16~bpp (bits per pixels) that achieve an average of 9.5-11.5 bpp (lossless) and 4.6-6.7 bpp (lossy, with a PSNR of 63 dB). The lossy compression is applied only on the background of the image, thereby preserving the regions of interest. The methods are based on a completely automatic gridding procedure of the image.
2004: Finding biclusters by random projections. S. Lonardi, W. Szpankowski, Q. Yang
CPM 2004 - Symposium on Combinatorial Pattern Matching, pp.102-116, Istanbul, Turkey, 2004. Proceedings, LNCS 3109. doi:10.1007/b98377. BibTex entry.

Conference version of the TCS 2006 paper above.
2004: Efficient selection of unique and popular oligos for large EST databases. J.Zheng, T.Close, T.Jiang, S.Lonardi
Bioinformatics, 20(13):2101-2112, 2004. PubMed abstract. BibTex entry.

In this paper, we study two complementary problems concerning the selection of short oligos, e.g. 20-50 bases, from a large database of tens of thousands of ESTs: (i) selection of oligos each of which appears (exactly) in one unigene but does not appear (exactly or approximately) in any other unigene and (ii) selection of oligos that appear (exactly or approximately) in many unigenes. The first problem is called the unique oligo problem and has applications in PCR primer and microarray probe designs, and library screening for gene-rich clones. The second is called the popular oligo problem and is also useful in screening genomic libraries. We present an efficient algorithm to identify all unique oligos in the unigenes and an efficient heuristic algorithm to enumerate the most popular oligos.
2004: Computational biology. S. Kurtz, S. Lonardi
Handbook on Data Structures and Applications, Dinesh Mehta and Sartaj Sahni (eds), pp.58.1-58.17, CRC Press, ISBN:1584884355, 2004.

The exploration of many computational problems arising in contemporary molecular biology has now grown to become a new field of Computer Science. A coarse selection would include sequence homology and alignment, physical and genetic mapping, protein folding and structure prediction, gene expression analysis, evolutionary trees construction, gene finding, assembly for shotgun sequencing, gene rearrangements and pattern discovery, among others. In this chapter we focus the attention to the applications of suffix trees to computational biology. In fact, suffix trees and their variants are among the most used (and useful) data structures in computational biology.
2004: Verbumculus and the discovery of unusual words. A. Apostolico, F. Gong, S. Lonardi
Journal of Computer and Science Technology (special issue in Bioinformatics), 19(1):22-41, 2004. BibTex entry.

Verbumculus is a suite of software tools for the efficient and fast detection of over- or under-represented words in nucleotide sequences. The inner core of Verbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatorics on words, that enable one to limit drastically and a priori the set of over- or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast.
2003: Efficient selection of unique and popular oligos for large EST databases. J. Zheng, T. Close, T. Jiang, S. Lonardi
CPM 2003 - Symposium on Combinatorial Pattern Matching, 384-401, Morelia, Mexico, 2003. Proceedings, LNCS 2676. BibTex entry.

Conference version of the 2004 Bioinformatics paper above.
2003: Monotony of surprise and large-scale quest for unusual words. A. Apostolico, M. E. Bock, S. Lonardi
Journal of Computational Biology, 10(2/3):283-311, 2003. doi:10.1089/10665270360688020. PubMed Abstract. BibTex entry.

In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
2002: Analysis of secondary structure elements of proteins using indexing techniques. C. Guerra, S. Lonardi, G. Zanotti
3DPVT 2002 - Symposium on 3D Data Processing Visualization and Transmission, 812-823, Padova, Italy, 2002. Proceedings. doi:10.1109/TDPVT.2002.1024166. BibTex entry.

In this paper we present a method for protein structure comparison that is based on indexing. Unlike most methods using indexing, ours does not use invariants of the C_alpha atoms of the proteins, rather it relies on geometric properties of the secondary structures. Given a set of protein structures, we compute the angles and distances of all the triplets of linear segments associated to the secondary structures of the proteins and use them to build hash tables. The hash tables can be used for fast retrieval of hypotheses of matches of a query protein against the database.
2002: Monotony of surprise and large-scale quest for unusual words. A. Apostolico, M. E. Bock, S. Lonardi
RECOMB 2002 - ACM Annual Conference on Research in Computational Molecular Biology, 22-31, Washington, DC, 2002 Proceedings. doi:10.1145/565196.565200. BibTex entry.

Conference version of the 2003 JCB paper above.
2001: Global detectors of unusual words: Design, implementation, and applications to pattern discovery in biosequences. S. Lonardi
Ph.D. Thesis, Department of Computer Sciences, Purdue University, August 2001. BibTex entry.

In this thesis, we show efficient and practical algorithms for the problem of detecting words that are, by some measure, over- or under-represented in the context of larger sequences. The design is based on subtly interwoven properties of statistics, pattern matching and combinatorics on words. These properties enable us to limit drastically and a priori the set of over- or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and to visualize such words in a fast and practically useful way. We also demonstrate that such anomaly detectors can be used successfully to discover exact patterns in biological sequences, by reporting results of a software tool, called Verbumculus on simulated data and test cases of practical interest.
2000: Efficient detection of unusual words. A. Apostolico, M. E. Bock, S. Lonardi, X. Xu
Journal of Computational Biology, 7(1/2):71-94, 2000. doi:10.1089/10665270050081397. PubMed Abstract. BibTex entry.

Words that are, by some measure, over- or underrepresented in the context of larger sequences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. Here we take the global approach of annotating the suffix tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree annotations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of significance.
2000: A whole-genome assembly of drosophila. E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H.-H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, J. C. Venter
Science, 287(5461):2196-2204, 2000. PDF doi:10.1126/science.287.5461.2196. PubMed Abstract. BibTex entry.

We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.
1999: Linear global detectors of redundant and rare substrings. A. Apostolico, M. E. Bock, S. Lonardi
DCC 1999 - IEEE Data Compression Conference, 168-177, Snowbird, Utah, 1999. Proceedings. doi:10.1109/DCC.1999.755666. BibTex entry.

The identification of strings that are, by some measure, redundant or rare in the context of larger sequences is an implicit goal of any data compression method. In the straightforward approach to searching for unusual substrings, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. As is well known, clever methods are available to compute and organize the counts of occurrences of all substrings of a given string. The corresponding tables take up the tree-like structure of a special kind of digital search index or trie. We show here that under several accepted measures of deviation from expected frequency, the candidate over- or under-represented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the O(n^2) possible substrings.
1996: Shape recovery and volume calculation from biplane angiography in the stereotactic radiosurgery treatment of Arteriovenous malformation. R. Foroni, J. Hoch, G. Giri, A. Pasoli, M. Gerosa, A. Pasqualin, A. Nicolato, E. Piovan, P. Zampieri, E. Bortolazzi, S. Lonardi
International Journal of Radiation Oncology, Biology and Physics, 35(3):565-577, 1996. PubMed Abstract. BibTex entry.

Three-dimensional (3D) volume reconstruction is easily feasible with axial, coronal, or sagittal computer tomography (CT) and nuclear magnetic resonance (NMR) scans. On the other hand, radiosurgicals treatment of arteriovenous malformations (AVM) is exclusively based on two orthogonal stereotactic projections, obtained with angiographic procedures. Most commonly, AVM volumes have been calculated by assimilating the nidus volume to a prolate ellipsoid. We present an algorithm dedicated to 3D structure reconstruction starting from two orthogonal stereotactic projections. This has been achieved using a heuristic approach, which has been widely adopted in the artificial intelligence domain.

Data Mining and Visualization

2015: Using the Minimum Description Length to Discover the Intrinsic Cardinality and Dimensionality of Time Series. B. Hu, T. Rakthanmanon, Y. Hao, S. Evans, S. Lonardi, E. Keogh
Data Mining and Knowledge Discovery, 29(2): 358-399, 2015. PDF. doi:10.1007/s10618-014-0345-2. BibTex entry.

Many algorithms for data mining or indexing time series data do not operate directly on the raw data, but instead they use alternative representations that include transforms, quantization, approximation, and multi-resolution abstractions. Choosing the best representation and abstraction level for a given task/dataset is arguably the most critical step in time series data mining. In this work, we investigate the problem of discovering the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series. The ability to automatically discover these intrinsic features has implications beyond selecting the best parameters for particular algorithms, as characterizing data in such a manner is useful in its own right and an important sub-routine in algorithms for classification, clustering and outlier discovery. We will frame the discovery of these intrinsic features in the Minimal Description Length framework. Extensive empirical tests show that our method is simpler, more general and more accurate than previous methods, and has the important advantage of being essentially parameter-free.
2012: MDL-based time series clustering. T. Rakthanmanon, E. Keogh, S. Lonardi, S. Evans
Knowledge and Information Systems, 33(2): 371-399, 2011. PDF. doi:10.1007/s10115-012-0508-7.

Time series data are pervasive across all human endeavors, and clustering is arguably the most fundamental data mining application. Given this, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series that have been carefully extracted from their original context, for example, gene expression profiles, individual heartbeats, or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived synthetic datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions that allow for the first time, the meaningful clustering of subsequences from a time series stream. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the minimum description length framework offers an efficient, effective, and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, speech recognition, zoology, gesture recognition, and industrial process analyses.
2011: Time series epenthesis: clustering time series streams requires ignoring some data. T. Rakthanmanon, E. Keogh, S. Lonardi, S. Evans
ICDM 2011 - IEEE International Conference on Data Mining, to appear, Vancouver, Canada, 2011. Proceedings. PDF.

Given the pervasiveness of time series data in all human endeavors, and the ubiquity of clustering as a data mining application, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series, e.g., gene expression profiles, individual heartbeats or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the Minimum Description Length (MDL) framework offers an efficient, effective and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, zoology and industrial process analyses.
2011: Discovering the intrinsic cardinality and dimensionality of time series using MDL. B. Hu, T. Rakthanmanon, Y. Hao, S. Evans, S. Lonardi, E. Keogh
ICDM 2011 - IEEE International Conference on Data Mining, to appear, Vancouver, Canada, 2011. Proceedings. PDF (accepted short version). PDF (long version).

Most algorithms for mining or indexing time series data do not operate directly on the original data, but instead they consider alternative representations that include transforms, quantization, approximation, and multi-resolution abstractions. Choosing the best representation and abstraction level for a given task/dataset is arguably the most critical step in time series data mining. In this paper, we investigate techniques to discover the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series. The ability to discover these intrinsic features has implications beyond selecting the best parameters for particular algorithms, as characterizing data in such a manner is useful in its own right and an important sub-routine in algorithms for classification, clustering and outlier discovery. We will frame the discovery of these intrinsic features in the Minimal Description Length (MDL) framework. Extensive empirical tests show that our method is simpler, more general and significantly more accurate than previous methods, and has the important advantage of being essentially parameter-free.
2007: Experiencing SAX: a novel symbolic representation of time series. J. Li, E. Keogh, S. Lonardi
Data Mining and Knowledge Discovery, doi:10.1007/s10618-007-0064-z. 15(2):107-144, 2007.

Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.
2007: Efficient discovery of unusual patterns in time series. S. Lonardi, J. Lin, E. Keogh, B. Chiu
New Generation Computing (Special issue on Knowledge Discovery from Data Streams), 25(1), 61-93, 2007. http://dx.doi.org/10.1007/s00354-006-0004-2. BibTex entry.

All previous attempts at finding surprising patterns in time series use a very limited notion of surprise, and/or do not scale to massive datasets. To overcome these limitations we propose a novel technique that defines a pattern surprising if the frequency of its occurrence differs substantially from that expected by chance, given some previously seen data. This notion has the advantage of not requiring the user to explicitly define what is a surprising pattern, which may be hard, or perhaps impossible, to elicit from a domain expert. Instead, the user gives the algorithm a collection of previously observed ``normal'' data. Our algorithm uses a suffix tree to efficiently encode the frequency of all observed patterns and allows a Markov model to predict the expected frequency of previously unobserved patterns. Once the suffix tree has been constructed, a measure of surprise for all the patterns in a new database can be determined in time and space linear in the size of the database. We demonstrate the utility of our approach with an extensive experimental evaluation.
2007: Compression-based data mining of sequential data. E. Keogh, S. Lonardi, C. A. Ratanamahatana, L. Wei, S.-H. Lee, J. Handley
Data Mining and Knowledge Discovery, doi:10.1007/s10618-006-0049-3. 14(1), 99-129, 2007. BibTex entry.

In this work, we show that recent results in bioinformatics, learning and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.
2006: Intelligent icons: integrating lite-weight data mining and visualization into GUI operating systems. E. Keogh, L. Wei, X. Xi, S. Lonardi, J. Shieh, S. Sirowy
ICDM 2006 - IEEE International Conference on Data Mining, 912-916, Hong Kong, 2006. doi:10.1109/ICDM.2006.90. BibTex entry.

The vast majority of visualization tools introduced so far are specialized pieces of software that are explicitly run on a particular dataset at a particular time for a particular purpose. In this work we introduce a novel framework for allowing visualization to take place in the background of normal day to day operation of any GUI based operation system such as MS Windows, OS X or Linux. By allowing visualization to occur in the background of quotidian computer activity (i.e. finding, moving, deleting, copying files etc) we allow a greater possibility of unexpected and serendipitous discoveries. Our system works by replacing the standard file icons with automatically created icons that reflect the contents of the files in a principled way. We call such icons INTELLIGENT ICONS. While there is little utility in examining an individual icon, examining groups of them allows us to take advantage of small multiples paradigm advocated by Tufte. We can further enhance the utility of Intelligent Icons by arranging them on the screen in a way that reflects their similarity/differences, rather than the traditional "vew by date", "view by size" etc.
2006: A bit level representation for time series data mining with shape based similarity. A. Bagnall, C. A. Ratanamahatana, E. Keogh, S. Lonardi, G. Janacek
Data Mining and Knowledge Discovery, 13(1):11-40, 2006. doi:10.1007/s10618-005-0028-0. BibTex entry.

Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper, we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than other popular representations. The usefulness of the proposed representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content.
2005: Integrating lite-weight but ubiquitous data mining into GUI operating systems. L. Wei, E. Keogh, X. Xi, S. Lonardi
Journal of Universal Computer Science, special Issue on Visual Data Mining, J.S.Aguilar-Ruiz (ed.), 11(11):1820-1834, 2005. BibTex entry.

In this work, we introduce a novel framework which allows visualization to take place in the background of normal day to day operations of any GUI based operating system such as MS Windows, OS X or Linux. Our system works by replacing the standard file icons with automatically generated icons that reflect the contents of the files in a principled way. We call such icons Intelligent Icons. While there is little utility in examining an individual icon, examining groups of them provides a greater possibility of unexpected and serendipitous discoveries. The utility of Intelligent Icons can be further enhanced by arranging them on the screen in a way that reflects their similarity/differences. We demonstrate the utility of our approach on data as diverse as DNA, text files, electrocardiograms, and Space Shuttle telemetry. In addition we show that our system is unique in also supporting fast and intuitive similarity search.
2005: Dot plots for time series analysis. D. Yankov, E. Keogh, S. Lonardi, A. W. Fu
ICTAI 2005 - IEEE International Conference on Tools with Artificial Intelligence, 159-168, Hong Kong, 2005. Proceedings. doi:10.1109/ICTAI.2005.60. BibTex entry.

In this paper we demonstrate how dot plots can be used for the analysis and mining of real-valued time series. We design a tool that creates highly descriptive dot plots which allow one to easily detect similarities, anomalies, reverse similarities, and periodicities well as changes in the frequencies of repetitions. As the underlying algorithm scales we with the input size, we also show the feasibility of the plots for on-line data monitoring.
2005: Visualizing and discovering non-trivial patterns in large time series databases. J. Lin, E. Keogh, S. Lonardi
Information Visualization, 4(2):61-82, 2005. doi:10.1057/palgrave.ivs.9500089. BibTex entry.

In this paper we present VizTree, a time series pattern discovery and visualization system based on augmenting suffix trees. VizTree visually summarizes both the global and local structures of time series data at the same time. In addition, it provides novel interactive solutions to many pattern discovery problems, including the discovery of frequently occurring patterns (motif discovery), surprising patterns (anomaly detection), and query by content. VizTree works by transforming the time series into a symbolic representation, and encoding the data in a modified suffix tree in which the frequency and other properties of patterns are mapped onto colors and other visual properties.
2005: Assumption-free anomaly detection in time series. L. Wei, N. Kumar, V. Lolla, E. Keogh, S. Lonardi, C. A. Ratanamahatana
SSDBM 2005 - International Statistical and Scientific Database Management Conference, 237-240, Santa Barbara, CA, 2005. Proceedings. BibTex entry.

In this demonstration, we will show an online anomaly detection system that does not need to be customized for individual domains, yet performs with exceptionally high precision/recall. The system is based on the recently introduced idea of time series bitmaps. To demonstrate the universality of our system, we will allow testing on independently annotated datasets from domains as diverse as ECGs, Space Shuttle telemetry monitoring, video surveillance, and respiratory data. In addition, we invite attendees to test our system with any dataset available on the web.
2005: A practical tool for visualizing and data mining medical time series. L. Wei, N. Kumar, V. N. Lolla, E. Keogh, S. Lonardi, C. A. Ratanamahatana, H. Van Herle
CBMS 2005 - IEEE International Symposium on Computer-Based Medical Systems, 341-346, Dublin, Ireland, 2005. Proceedings. doi:10.1109/CBMS.2005.17. BibTex entry.

The increasing interest in time series data mining has had surprisingly little impact on real world medical applications. Practitioners who work with time series on a daily basis rarely take advantage of the wealth of tools that the data mining community has made available. In this work, we attempt to address this problem by introducing a parameter-light tool that allows users to efficiently navigate through large collections of time series. Our approach extracts features from a time series of arbitrary length and uses information about the relative frequency of these features to color a bitmap in a principled way. By visualizing the similarities and differences within a collection of bitmaps, a user can quickly discover clusters, anomalies, and other regularities within the data collection. We demonstrate the utility of our approach with a set of comprehensive experiments on real datasets from a variety of medical domains.
2005: A novel bit level time series representation with implications for similarity search and clustering. C. A. Ratanamahatana, E. Keogh, A. J. Bagnall, S. Lonardi
PAKDD 2005 - Pacific-Asia Conference on Knowledge Discovery and Data Mining, 771-777, Hanoi, Vietnam, 2005. Proceedings, LCNS 3518. doi:10.1007/11430919_90. BibTex entry.

In this work, we introduce a new technique based on a bit level approximation of the data. The representation has several important advantages over existing techniques. One unique advantage is that it allows raw data to be directly compared to the reduced representation, while still guaranteeing lower bounds to Euclidean distance. This fact can be exploited to produce faster exact algorithms for similarly search. In addition, we demonstrate that our new representation allows time series clustering to scale to much larger datasets.
2005: Time-series bitmaps: a practical visualization tool for working with large time series databases. N. Kumar, N. Lolla, E. Keogh, S. Lonardi, C. A. Ratanamahatana, L. Wei
SDM 2005 - SIAM Conference in Data Mining, 531-535, Newport Beach, CA, 2005. Proceedings. BibTex entry.

The increasing interest in time series data mining in the last decade has resulted in the introduction of a variety of similarity measures, representations and algorithms. Surprisingly, this massive research effort has had little impact on real world applications. Real world practitioners who work with time series on a daily basis rarely take advantage of the wealth of tools that the data mining community has made available. In this work we attempt to address this problem by introducing a simple parameter-light tool that allows users to efficiently navigate through large collections of time series. Our system has the unique advantage that it can be embedded directly into the any standard graphical user interface, such as Microsoft Windows, thus making deployment easier. Our approach extracts features from a time series of arbitrary length, and uses information about the relative frequency of its features to color a bitmap in a principled way. By visualizing the similarities and differences within a collection of bitmaps, a user can quickly discover clusters, anomalies, and other regularities within their data collection.
2004: Towards parameter-free data mining. E. Keogh, S. Lonardi, C. A. Ratanamahatana
KDD 2004 - ACM Conference on Knowledge Discovery and Data Mining, 206-215, Seattle, WA, 2004 Proceedings. doi:10.1145/1014052.1014077. BibTex entry.

Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.
2004: Visually mining and monitoring massive time series. J. Lin, E. Keogh, S. Lonardi, J. P. Lankford, D. M. Nystrom
KDD 2004 - ACM Conference on Knowledge Discovery and Data Mining, 460-469, Seattle, WA, 2004. Proceedings. doi:10.1145/1014052.1014104. BibTex entry.
This work also appears as a VLDB 2004 demo paper, under the title VizTree: a tool for visually mining and monitoring massive time series.
VLDB 2004 - International Conference on Very Large Data Bases, 1269-1272, Toronto, Canada, 2004. Proceedings.

Moments before the launch of every space vehicle, engineering discipline specialists must make a critical go/no-go decision. The cost of a false positive, allowing a launch in spite of a fault, or a false negative, stopping a potentially successful launch, can be measured in the tens of millions of dollars, not including the cost in morale and other more intangible detriments. The Aerospace Corporation is responsible for providing engineering assessments critical to the go/no-go decision for every Department of Defense space vehicle. These assessments are made by constantly monitoring streaming telemetry data in the hours before launch. We will introduce VizTree, a novel time-series visualization tool to aid the Aerospace analysts who must make these engineering assessments. VizTree was developed at the University of California, Riverside and is unique in that the same tool is used for mining archival data and monitoring incoming live telemetry. The use of a single tool for both aspects of the task allows a natural and intuitive transfer of mined knowledge to the monitoring task. Our visualization approach works by transforming the time series into a symbolic representation, and encoding the data in a modified suffix tree in which the frequency and other properties of patterns are mapped onto colors and other visual properties.
2003: Probabilistic discovery of time series motifs. B. Chiu, E. Keogh, S. Lonardi
KDD 2003 - ACM Conference on Knowledge Discovery and Data Mining, 493-498, Washington, DC, 2003. Proceedings. doi:10.1145/956750.956808. BibTex entry.

Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise.Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or "don't care" symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
2003: A symbolic representation of time series, with implications for streaming algorithms. J. Lin, E. Keogh, S. Lonardi, B. Chiu
DMKD 2003 - ACM Workshop on Research Issues in Data Mining and Knowledge Discovery, 2-11, San Diego, CA, 2003. Proceedings. doi:10.1145/882082.882086.

In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.
2002: Mining motifs in massive time series databases. P. Patel, E. Keogh, J. Lin, S. Lonardi
ICDM 2002 - IEEE Conference on Data Mining, 370-377, Maebashi City, Japan, 2002. Proceedings. doi:10.1109/ICDM.2002.1183925. BibTex entry.

The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs" because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition, it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this work we carefully motivate, then introduce, a non-trivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.
2002: Finding motifs in time series. J. Lin, E. Keogh, S. Lonardi, P. Patel
TDM 2002 - 2nd Workshop on Temporal Data Mining at the 8th ACM Conference on Knowledge Discovery and Data Mining, 53-68, Edmonton, Alberta, Canada, 2002. BibTex entry.

Workshop version of the ICDM'02 paper.
2002: Finding surprising patterns in a time series database in linear time and space. E. Keogh, S. Lonardi, B. Chiu
KDD 2002 - ACM Conference on Knowledge Discovery and Data Mining, 550-556, Edmonton, Alberta, Canada, 2002. Proceedings. doi:10.1145/775047.775128. BibTex entry.

The problem of finding a specified pattern in a time series database (i.e. query by content) has received much attention and is now a relatively mature field. In contrast, the important problem of enumerating all surprising or interesting patterns has received far less attention. This problem requires a meaningful definition of "surprise", and an efficient search technique. All previous attempts at finding surprising patterns in time series use a very limited notion of surprise, and/or do not scale to massive datasets. To overcome these limitations we introduce a novel technique that defines a pattern surprising if the frequency of its occurrence differs substantially from that expected by chance, given some previously seen data.

Data Compression and Information Theory

2007: Error resilient LZ'77 data compression: algorithms, analysis, and experiments. S. Lonardi, W. Szpankowski, M. D. Ward
IEEE Transaction on Information Theory, 53(5):, 1799-1813, 2007. doi:10.1109/TIT.2007.894689. BibTex entry.

We propose a joint source-channel coding algorithm capable of correcting some errors in the popular Lempel-Ziv'77 scheme without practically losing any compression power. This can be achieved because the LZ'77 encoder does not completely eliminate the redundancy present in the input sequence. One source of redundancy can be observed when a LZ'77 phrase has multiple matches. In this case, LZ'77 can issue a pointer to any of those matches, and a particular choice carries some additional bits of information. We call a scheme with embedded redundant information the LZS'77 algorithm. We analyze the number of longest matches in such a scheme and prove that it follows the logarithmic series distribution with mean 1/h (plus some fluctuations), where h is the source entropy. Thus, the number of redundant bits is well concentrated around its mean, a highly desirable property for error correction. These analytic results are proved by a combination of combinatorial, probabilistic and analytic methods (e.g., Mellin transform, depoissonization, combinatorics on words). In fact, we analyze LZS'77 by studying the multiplicity matching parameter in a suffix tree, which in turn is analyzed via comparison to its independent version, also known as a trie. Finally, we present an algorithm in which a channel coder (e.g., Reed-Solomon coder) succinctly uses the inherent additional redundancy left by the LZS'77 encoder to detect and correct a limited number of errors. We call such a scheme the LZRS'77 algorithm. LZRS'77 is perfectly backward-compatible with LZ'77, that is, a file compressed with our error-resistant LZRS'77 can still be decompressed by a generic LZ'77 decoder.
2006: Efficient information compression in sensor networks. S. Lin, V. Kalogeraki, D. Gunopulos, S. Lonardi
Int. Journal on Sensor Networks, Special Issue on Recent Developments in Wireless Sensor Networks, 1(3/4):229-240, 2006.

In the emerging area of wireless sensor networks, one of the most typical challenges is to retrieve historical information from the sensor nodes. Due to the resource limitations of sensor nodes (processing, memory, bandwidth, and energy), the collected information of sensor nodes has to be compressed quickly and precisely for transmission. In this paper, we propose a new technique -- the ALVQ (Adoptive Learning Vector Quantization) algorithm to compress this historical information. The ALVQ algorithm constructs a codebook to capture the prominent features of the data and with these features all the other data can be piece-wise encoded for compression. In addition, we extend our ALVQ algorithm to compress multi-dimensional information by transforming the multidimensional data into one-dimensional data array. Finally, we consider the problem of transmitting data in a sensor network while maximizing the precision. We show how we apply our algorithm so that a set of sensors can dynamically share a wireless communication channel.
2006: A compression-boosting transform for two-dimensional data. Q. Yang, S. Lonardi, A. Melkman
AAIM 2006 - International Conference on Algorithmic Aspects in Information and Management, 126-137, Hong Kong, China, 2006. Proceedings. LNCS 4041. doi:10.1007/11775096_13. BibTex entry.

We introduce a novel invertible transform for two- dimensional data which has the objective of reordering the matrix so it will improve its (lossless) compression at later stages. The transform requires to solve a computationally hard problem for which a randomized algorithm is used. The inverse transform is fast and can be implemented in linear time in the size of the matrix. Preliminary experimental results show that the reordering improves the compressibility of digital images.
2006: Online information compression in sensor networks S. Lin, V. Kalogeraki, D. Gunopulos, S. Lonardi
ICC 2006 - IEEE International Conference on Communications, Istanbul, Turkey, 2006. BibTex entry.

Conference version of the IJSN paper above.
2006: Error-resilient LZW data compression. Y. Wu, S. Lonardi, W. Szpankowski
DCC 2006 - IEEE Data Compression Conference, 193-202, Snowbird, Utah, 2006. Proceedings. doi:10.1109/DCC.2006.33. BibTex entry.

Lossless data compression systems are typically regarded as very brittle to transmission errors. This limits their applicability to domains like noisy tetherless channels or file systems that can possibly get corrupted. Here we show how a popular lossless data compression scheme used in file formats GIF, PDF, and TIFF, among others, can be made error-resilient in such a way that the compression performance is minimally affected. The new scheme is designed to be backward-compatible, that is, a file compressed with our error-resilient algorithm can be still decompressed by the original decoder. In this preliminary report, we present our scheme, collect some experimental data supporting our claims, and provide some theoretical justifications.
2005: Storage and transmission of microarray images. Y.Luo, S.Lonardi
Drug Discovery Today, 10(23/24):1689-1695, 2005. doi:10.1016/S1359-6446(05)03647-0. BibTex entry.

With the recent explosion of interest in microarray technology, massive amounts of microarray images are currently being produced. The storage and transmission of these types of data are becoming increasingly challenging. This article reviews the latest technologies that allow for the compression and storage of microarray images in dedicated database systems.
2004: Gridding and compression of microarray images. S. Lonardi, Y. Luo
CSB 2004 - IEEE Computational Systems Bioinformatics Conference, 122-130, Stanford, CA, 2004. Proceedings. doi:10.1109/CSB.2004.1332424. BibTex entry.

With the recent explosion of interest in microarray technology, massive amounts of microarray images are currently being produced. The storage and the transmission of this type of data are becoming increasingly challenging. Here we propose lossless and lossy compression algorithms for microarray images originally digitized at 16~bpp (bits per pixels) that achieve an average of 9.5-11.5 bpp (lossless) and 4.6-6.7 bpp (lossy, with a PSNR of 63 dB). The lossy compression is applied only on the background of the image, thereby preserving the regions of interest. The methods are based on a completely automatic gridding procedure of the image.
2004: Augmenting LZ-77 with authentication and integrity assurance capabilities. M. J. Atallah, S. Lonardi
Concurrency and Computation: Practice and Experience, 16(11): 1063-1076, 2004. doi:10.1002/cpe.804. BibTex entry.

In this paper, we propose a simple variation on the classic LZ-77 algorithm that allows one to hide, within the compressed document, enough information to warrant its authenticity and integrity. The design is based on the unpredictability of a certain class of pseudo-random number generators, in such a way that the hidden data cannot be retrieved in a reasonable amount of time by an attacker (unless the secret bit-string key is known). Since it can still be decompressed by the original LZ-77 algorithm, the embedding is completely transparent and backward-compatible, making it possible to deploy it without disrupting service. Experiments show that the degradation in compression due to the embedding is almost negligible.
2004: Error resilient LZ'77 and its analysis. S. Lonardi, W. Szpankowski, M. Ward
ISIT 2004 - IEEE International Symposium on Information Theory, p.56, Chicago, IL, 2004.

2003: Authentication of LZ-77 compressed data. M. J. Atallah, S. Lonardi
SAC 2003 - ACM Symposium on Applied Computing, 282-287, Melbourne, FL, 2003. Proceedings. doi:10.1145/952532.952591. BibTex entry.

Conference version of the CCPE'04 paper.
2003: Joint source-channel LZ'77 coding. S. Lonardi, W. Szpankowski
DCC 2003 - IEEE Data Compression Conference, 273-283, Snowbird, Utah, 2003. Proceedings. doi:10.1109/DCC.2003.1194018. BibTex entry.

Limited memory and bounded communication resources require powerful data compression techniques, but at the same time noisy tetherless channels and/or corrupted file systems need error correction capabilities. Joint source-channel coding has emerged as a viable solution to this problem. The first practical joint source-channel coding algorithm was presented capable of correcting errors in the popular Lempel-Ziv'77 scheme without practically losing any compression power. This is possible since the LZ'77 (as well as gzip) encoder does not completely remove all redundancy. The inherent additional redundancy left by LZ'77 encoder was used succinctly by a channel coder (e.g., Reed Solomon coder) to protect against a limited number of errors. In addition to these, the scheme proposed is perfectly backward-compatible that is, a file compressed with error-resilient LZ'77 can still be decompressed by a common LZ'77 decoder. Algorithms and supporting experimental data were presented to support the system's claims and theoretical justifications.
2000: Off-line compression by greedy textual substitution. A. Apostolico, S. Lonardi
Proceedings of the IEEE, 88(11): 1733-1744, 2000. doi:10.1109/5.892709. BibTex entry.
Greedy off-line textual substitution refers to the following approach to compression or structural inference. Given a long text string x, a substring W is identified such that replacing all instances of W in X except one by a suitable pair of pointers yields the highest possible contraction of X; the process is then repeated on the contracted text string until substrings capable of producing contractions can no longer be found. This paper examines computational issues arising in the implementation of this paradigm and describes some applications and experiments
2000: Compression of biological sequences by greedy off-line textual substitution A. Apostolico, S. Lonardi
DCC 2000 - IEEE Data Compression Conference, 143-152, Snowbird, Utah, 2000. Proceedings. doi:10.1109/DCC.2000.838154. BibTex entry.
We follow one of the simplest possible steepest descent paradigms. This consists of performing repeated stages in each one of which we identify a substring of the current version of the text yielding the maximum compression, and then replace all those occurrences except one with a pair of pointers to the untouched occurrence. This is somewhat dual with respect to the bottom up vocabulary buildup scheme considered by Rubin. This simple scheme already poses some interesting algorithmic problems. In terms of performance, the method does outperform current Lempel-Ziv implementations in most of cases. Here we show that, on biological sequences, it beats all other generic compression methods and approaches the performance of methods specifically built around some peculiar regularities of DNA sequences, such as tandem repeats and palindromes, that are neither distinguished nor treated selectively here. The most interesting performances, however, are obtained in the compression of entire groups of genetic sequences forming families with similar characteristics. This is becoming a standard and useful way to group sequences in a growing number of important specialized databases. On such inputs, the approach presented here yields scores that are not only better than those of any other method, but also improve increasingly with increasing input size. This is to be attributed to a certain ability to capture distant relationships among the sequences in a family.
1999: Fractal image approximation and orthogonal bases. S. Lonardi, P. Sommaruga
Image Communication, 14(5): 413-423, Elsevier, 1999. doi:10.1016/S0923-5965(98)00021-6. BibTex entry.
We are concerned with the fractal approximation of multidimensional functions in L^2. In particular, we treat a position-dependent approximation using orthogonal bases of L^2 and no search. We describe a framework that establishes a connection between the classic orthogonal approximation and the fractal approximation. The main theorem allows easy and univocal computation of the parameters of the approximating function. From the computational perspective, the result avoids to solve ill-conditioned linear systems that are usually needed in former fractal approximation techniques. Additionally, using orthogonal bases the most compact representation of the approximation is obtained. We discuss the approximation of gray-scale digital images as a direct application of our approximation scheme.
1999: Off-line data compression by textual substitution. S.Lonardi
Ph.D. Thesis, Dipartimento di Ingegneria Informatica e Elettronica,Università di Padova, Italy, December 1999. BibTex entry.
1998: Some theory and practice of greedy off-line textual substitution. A. Apostolico, S. Lonardi
DCC 1998 - IEEE Data Compression Conference, 119-128, Snowbird, Utah, 1998. Proceedings. doi:10.1109/DCC.1998.672138. BibTex entry.
Greedy off-line textual substitution refers to the following steepest descent approach to compression or structural inference. Given a long text string x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted text string, until substrings capable of producing contractions can no longer be found. This paper examines the computational issues and performance resulting from implementations of this paradigm in preliminary applications and experiments. Apart from intrinsic interest, these methods may find use in the compression of massively disseminated data, and lend themselves to efficient parallel implementation, perhaps on dedicated architectures.

Other

2009: CPM's 20th Anniversary: A Statistical Retrospective. E. Y. Harris, T. Lecroq, G. Kucherov, S. Lonardi
CPM 2009 - Annual Symposium on Combinatorial Pattern Matching, pp.1-11, Lille, France, 2009. Proceedings, LNCS 5577. doi:10.1007/978-3-642-02441-2_1.

This year the Annual Symposiumon Combinatorial Pattern Matching (CPM) celebrates its 20th anniversary. Over the last two decades the Symposium has established itself as the most recognized international forum for research in combinatorial pattern matching and related applications. Contributions to the conference typically address issues of searching and matching strings and more complex patterns such as trees, regular expressions, graphs, point sets, and arrays. Advances in this field rely on the ability to expose combinatorial properties of the computational problem at hand and to exploit these properties in order to either achieve superior performance or identify conditions underwhich searches cannot be performed efficiently. The meeting also dealswith combinatorial problems in computational biology, data compression, data mining, coding, information retrieval, natural language processing and pattern recognition.
2007: Soft-core processor customization using the design of experiments paradigm. D. Shelton, F. Vahid, S. Lonardi
DATE 2007 - Design, Automation and Test in Europe, pp. 821-826, Nice, France, 2007. doi:10.1145/1266366.1266541.

Parameterized components are becoming more commonplace in system design. The process of customizing parameter values for a particular application, called tuning, can be a challenging task for a designer. Here we focus on the problem of tuning a parameterized soft-core microprocessor to achieve the best performance on a particular application, subject to size constraints. We map the tuning problem to a well-established statistical paradigm called Design of Experiments (DoE), which involves the design of a carefully selected set of experiments and a sophisticated analysis that has the objective to extract the maximum amount of information about the effects of the input parameters on the experiment. We apply the DoE method to analyze the relation between input parameters and the performance of a soft-core microprocessor for a particular application, using only a small number of synthesis/execution runs. The information gained by the analysis in turn drives a soft-core tuning heuristic. We show that using DoE to sort the parameters in order of impact results in application speedups of 6x-17x versus an un-tuned base soft-core. When compared to a previous single-factor tuning method, the DoE-based method achieves 3x-6x application speedups, while requiring about the same tuning runtime. We also show that tuning runtime can be reduced by 40-45% by using predictive tuning methods already built into a DoE tool.
2007: Two level microprocessor-accelerator partitioning. S. Sirowy, Y. Wu, S. Lonardi, F. Vahid
DATE 2007 - Design, Automation and Test in Europe, pp.313-318, Nice, France, 2007. doi:10.1145/1266366.1266433.

The integration of microprocessors and field-programmable gate array (FPGA) fabric on a single chip increases both the utility and necessity of tools that automatically move software functions from the microprocessor to accelerators on the FPGA to improve performance or energy. Such hardware/software partitioning for modern FPGAs involves the problem of partitioning functions among two levels of accelerator groups tightly-coupled accelerators that have fast single-clock-cycle memory access to the microprocessors memory, and loosely-coupled accelerators that access memory through a bridge to avoid slowing the main clock period with their longer critical paths. We introduce this new two-level accelerator-partitioning problem, and we describe a novel optimal dynamic programming algorithm to solve the problem. By making use of the size constraint imposed by FPGAs, the algorithm has what is effectively quadratic runtime complexity, running in just a few seconds for examples with up to 25 accelerators, obtaining an average performance improvement of 35% compared to a traditional single-level bus architecture.
2007: Clock-frequency assignment for multiple clock domains yystems-on-a-chip. S. Sirowy, Y. Wu, S. Lonardi, F. Vahid
DATE 2007 - Design, Automation and Test in Europe, pp.397-402, Nice, France, 2007. doi:10.1145/1266366.1266450.

Modern systems-on-a-chip platforms support multiple clock domains, in which different sub-circuits are driven by different clock signals. Although the frequency of each domain can be customized, the number of unique clock frequencies on a platform is typically limited. We define the clock-frequency assignment problem to be the assignment of frequencies to processing modules, each with an ideal maximum frequency, such that the sum of module processing times is minimized, subject to a limit on the number of unique frequencies. We develop a novel polynomial-time optimal algorithm to solve the problem, based on dynamic programming. We apply the algorithm to the particular context of post-improvement of accelerator-based hardware/software partitioning, and demonstrate 1.5x-4x additional speedups using just three clock domains.
2004: On average sequence complexity. S. Janson, S. Lonardi, W. Szpankowski
Theoretical Computer Science, 326(1-3): 213-227, 2004. doi:10.1016/j.tcs.2004.06.023. BibTex entry.

In this paper we study the average behavior of the number of distinct substrings in a text of size n over an alphabet of cardinality k. This quantity is called the complexity index and it captures the richness of the language used in a sequence. For example, sequences with low complexity index contain a large number of repeated substrings and they eventually become periodic (e.g., tandem repeats in a DNA sequence). In order to identify unusually low- or high-complexity strings one needs to determine how far are the complexities of the strings under study from the average or maximum string complexity. While the maximum string complexity was studied quite extensively in the past, to the best of our knowledge there are no results concerning the average complexity. We first prove that for a sequence generated by a mixing model (which includes Markov sources) the average complexity is asymptotically equal to n^2/2 which coincides with the maximum string complexity. However, for a memoryless source we establish a more precise result.
2004: On average sequence complexity. S. Janson, S. Lonardi, W. Szpankowski
CPM 2004 - Symposium on Combinatorial Pattern Matching, 74-88, Istanbul, Turkey, 2004. Proceedings. LNCS 3109. doi:10.1007/b98377. BibTex entry.

Conference version of the TCS'04 paper above.
2002: A speed-up for the commute between subword trees and DAWGs. A. Apostolico, S. Lonardi
Information Processing Letters, 83(3): 159-161, 2002. doi:10.1016/S0020-0190(01)00327-1. BibTex entry.

A popular way to describe and build the DAWG or Directed Acyclic Word Graph of a string is by transformation of the corresponding subword tree. This transformation, which is not difficult to reverse, is easy to grasp and almost trivial to implement except for the assumed implication of a standard tree isomorphism algorithm. Here we point out a simple property of subword trees that makes checking tree isomorphism in this context a straightforward process, thereby simplifying the transformation significantly. Subword trees and DAWGs arise rather ubiquitously in applications of string processing, where they often play complementary roles. Efficient conversions are thus especially desirable.
Note: After publishing the manuscript it was brought to our attention that the main result was already published in Dan Gusfield's book.
1994: Encoding pyramids by labeling RAAM. S. Lonardi, A. Sperduti, A. Starita
NNSP 1994 - IEEE Workshop on Neural Networks for Signal Processing, 651-660, Ermioni, Greece, 1994. Proceedings. doi:10.1109/NNSP.1994.366000. BibTex entry.
In this paper we present preliminary results on the application of Labeling RAAM (LRAAM) for encoding pyramids. The LRAAM is a neural network able to develop reduced representations of labeled directed graphs. The capability of such representations to preserve structural similarities is demonstrated on a pyramid. We suggest to exploit this skill in data compression and/or to discover scale affine auto-similarities.


Copyright Notice

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Papers published by the Association for Computing Machinery are Copyright © by the Association for Computing Machinery, Inc. Papers published by the IEEE are Copyright by the IEEE Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

 


Address



University of California, Riverside
Multidisciplinary Research Building, room 3130
Riverside, CA 92521
Tel: (951) 827-2203
Fax: (951) 827-4643

 

Footer