Bioinformatics, Computational Biology, Data Mining, Databases, Signal Processing, Image Processing, Medical Image Processing,
Pattern Recognition, Machine Learning Techniques, Web Applications and Interfaces,
Electronic Government, Virtual Reality
Project Description for CS 234
Implementation of an R framework that predicts physical-chemical molecular features of compounds. The compounds are described
in a sdf (structure data) file. This framework will provide to the user functions to parse sdf files and
predict molecular physical descriptors for a compound, such as the number of specific atoms, the molecular weight, the
Acid-base ionization (or, dissociation) constant (pKa), and the Octanol-water partition coefficient (logP). These
descriptors could be then used for QSAR (Quantitative Structure-Activity Relationship) tools, including the structural
similarity search in chemical compound databases.
Progress Reports for CS 234
10/21/2006 – 11/03/2006:
Computation of the number of specific atoms of a molecule. Computation of the implicit Hydrogen atoms (these atoms
are not described in the .sdf file and have to be added) of the molecule. Computation of the molecular weight.
The prediction of the implicit Hydrogen atoms is not a trivial task. The two most-used in chemoinformatics area open source libraries
(JoeLib and Open Babel) compute the implicit Hydrogen atoms with error rate around 5% (calculated in a dataset with 2,100 compounds).
These methods try to identify specific SMARTS patterns of the atoms in the molecule. My approach is much simpler and
takes into consideration simple chemistry rules (the valence of the atoms and the charge), and achieves 0% error rate
(in the same dataset of 2,100). The consistency of the method has to be verified in other compounds datasets also.
The problem is that there is not easy to find a dataset of compounds with precalculated Molecular Weight from the vendor,
that is necessary to use as a reference dataset.
10/6/2006 – 10/20/2006:
Implementation of a parser of sdf (structure data file) files using the R language and creation of an object-oriented
model that represents a compound. An sdf file contains the following blocks:
The parser takes as input an sdf file and creates a list object of R with the compound objects,
each one corresponding to a compound parsed from the input file. The object model consists of
the following classes: Atom, AtomList, Bond, BondList, ExtraDataItem, ExtraDataList, Compound, CompoundList,
and Parser. Methods are implemented for parsing all the compounds of the .sdf file or for parsing only specific
compounds defined by the user using the internal id of the compound (a counter of the compounds inside the sdf files).
The user is also provided with methods to get the atom sequence of the compound, the coordinates of the atoms or the
information about the bonds. Due to limitations to the memory, the parser can process sdf files containing up
to 25,000 compounds.
- Header block: includes the name of the molecule (compound), information about the program that created the sdf file, and a count line that indicates how many atoms and bonds the compound has
- Atom block: a set of lines, each corresponding to a specific atom of the compound, that specify the 3d coordinates, the symbol, the charge and other information about an atom.
- Bond block: a set of lines each corresponding to a specific bond between two atoms, that specify which atoms are connected with that bond, the type of the bond and other information.
- Data block: Pairs of tagged data that provide additional information about the compound.