Anna Charisi


Phd Student of the Genetics, Genomics and Bioinformatics Program
Department of Computer Science & Engineering
University of California Riverside
354 Engineering Building Unit 2. Room 363
Telephone: +1 951 787 2838
Email: acharisi@cs.ucr.edu

Research Interests:

Bioinformatics, Computational Biology, Data Mining, Databases, Signal Processing, Image Processing, Medical Image Processing, Pattern Recognition, Machine Learning Techniques, Web Applications and Interfaces, Electronic Government, Virtual Reality


Project Description for CS 234

Implementation of an R framework that predicts physical-chemical molecular features of compounds. The compounds are described in a sdf (structure data) file. This framework will provide to the user functions to parse sdf files and predict molecular physical descriptors for a compound, such as the number of specific atoms, the molecular weight, the Acid-base ionization (or, dissociation) constant (pKa), and the Octanol-water partition coefficient (logP). These descriptors could be then used for QSAR (Quantitative Structure-Activity Relationship) tools, including the structural similarity search in chemical compound databases.


Progress Reports for CS 234

10/21/2006 11/03/2006:
Computation of the number of specific atoms of a molecule. Computation of the implicit Hydrogen atoms (these atoms are not described in the .sdf file and have to be added) of the molecule. Computation of the molecular weight. The prediction of the implicit Hydrogen atoms is not a trivial task. The two most-used in chemoinformatics area open source libraries (JoeLib and Open Babel) compute the implicit Hydrogen atoms with error rate around 5% (calculated in a dataset with 2,100 compounds). These methods try to identify specific SMARTS patterns of the atoms in the molecule. My approach is much simpler and takes into consideration simple chemistry rules (the valence of the atoms and the charge), and achieves 0% error rate (in the same dataset of 2,100). The consistency of the method has to be verified in other compounds datasets also. The problem is that there is not easy to find a dataset of compounds with precalculated Molecular Weight from the vendor, that is necessary to use as a reference dataset.

10/6/2006 10/20/2006:
Implementation of a parser of sdf (structure data file) files using the R language and creation of an object-oriented model that represents a compound. An sdf file contains the following blocks:

  • Header block: includes the name of the molecule (compound), information about the program that created the sdf file, and a count line that indicates how many atoms and bonds the compound has
  • Atom block: a set of lines, each corresponding to a specific atom of the compound, that specify the 3d coordinates, the symbol, the charge and other information about an atom.
  • Bond block: a set of lines each corresponding to a specific bond between two atoms, that specify which atoms are connected with that bond, the type of the bond and other information.
  • Data block: Pairs of tagged data that provide additional information about the compound.
The parser takes as input an sdf file and creates a list object of R with the compound objects, each one corresponding to a compound parsed from the input file. The object model consists of the following classes: Atom, AtomList, Bond, BondList, ExtraDataItem, ExtraDataList, Compound, CompoundList, and Parser. Methods are implemented for parsing all the compounds of the .sdf file or for parsing only specific compounds defined by the user using the internal id of the compound (a counter of the compounds inside the sdf files). The user is also provided with methods to get the atom sequence of the compound, the coordinates of the atoms or the information about the bonds. Due to limitations to the memory, the parser can process sdf files containing up to 25,000 compounds.