Recall that mRNA suitable for translation by a ribosome will be made up of exons that have been spliced together after the transcription process. If introns are to be spliced out, cell mechanisms must recognize the mRNA sites where splicing is to take place. To get more details you may consult this URL. Splice sites at the start of introns are called "donor sites" and splice sites at the end of introns are called "acceptor sites". Typically, all introns start with GT and end with AG but since each of these sequences is so short they will also appear elsewhere in the DNA. Consequently, this "GT-start and AG-end rule" is not sufficient for the characterization of an intron.

Your job will be to develop software that will be trained to recognize the splice boundaries.It will then be tested on other DNA segments so that you can evaluate the success of the training.

Data to be Used

This file contains a list of 570 vertebrate genes with the exons identified. Here is a sample of the information for a typical gene:

AGGGLINE 3066 3157 3281 3503 4393 4521

The first few characters represent the name of the sequence.

This is followed by a set of pairs, each pair identifying an exon. In this case, we have three exons covering the following ranges: [3066, 3157], [3281, 3503], and [4393, 4521].

The very long file at this URL contains the DNA sequences for each of the 570 genes (both introns and exons).

For example, it contains the DNA sequence for AGGGLINE. Here are a few of the first lines for this gene:

>AGGGLINE
CAAGGCTGCTGTCACTAGCCTGTGGGGCAAGGTGAATGTGGAAGATGCTGGGGGAGAAAC
CCTGGGAAGGTAGGCTCTGGTGACCAGGACGAGGGAGGGAAGGAAGGAACCTATGCTTGG
CAAAAGTTCAGGCTGCCTCTCAGGATTTGTGGCACCTTCTGACTTTCAAACTGCTATTGT


The genes are given in FASTA format.

You should assume that the first nucleotide (a C in this case) is indexed as position 1 (not position 0).

Also, when you encounter an exon start position that is larger than an exon end position you should assume that the exon is actually on the "other" DNA strand. In other words, it should be read in the reverse order and in a complemented fashion.

Design a hidden Markov architecture that you believe would be suitable for this task. Train it on the last 470 genes of the data set. Use the first 100 genes as a test set that will evaluate how well your model performs.

The program should accept a test gene and attempt to correctly label each nucleotide in the gene as either "e" for exon, or "i" for intron except for the following cases: If a GT pair of nucleotides is a legitimate splice-start it should be labeled as "dd" for donor and if an AG pair is a legitimate splice-end, it should be labeled as "aa" for acceptor.

You should also design (with reasonable justification) the procedure for evaluating the success of your HMM on any particular gene. Note that this is somewhat complicated since it may have to recognize several exons in one gene. What do you think would be the best way to evaluate its success across all such exons?

Recalling the mutual information studies discussed earlier in class, you might want to consider a second order Markov process instead of a first order process.

Your report should include the following:

  1. A description of your HMM topology (that is, its architecture).
  2. A reasonably persuasive argument as to why this topology should work.
  3. The procedure you used for training the HMM.
  4. How you measured the success of the HMM.

For extra information about HMMs see Hughey and Krogh.This URL also includes a short description of model estimation.

(courtesy of Forbes Burkowski, University of Waterloo)