Stefano Lonardi

Assistant Professor, Computer Science & Engineering

University of California, Riverside, CA 92521

Last updated:
Sept 25, 2001

GEN 240B, Spring 2002: Sequence Comparisons and Genomics Databases by Close, Jiang, Lonardi, Swanson (due June 3rd, 2002)

We will perform similarity searches using BLAST, FASTA as hands-on tools. We will also briefly discuss Smith-Waterman algorithm as a potential similarity search tool.

FASTA Protein Similarity Search

  • Link to the Fasta3 search engine at the European Bioinformatics Institute. You may check out the links to the "Help" and "Tool" screens.
  • Your e-mail address and search title for the sequence are optional entries. Enter "Murine IL-7 Receptor" as your search title. Choose "interactive" as the option for results although you can get them by e-mail.
  • Change the scoring matrix to Blosum62 since this matrix has been shown to detect most protein similarities when the query sequence is long. (The murine IL-7 receptor is 459 amino acids long.)
  • In order to limit the number of hits (similar sequences) in this search, change the number of scores to 30 and the alignments to 10. You may get a histogram of the results by changing the "HIST" drop-down menu to "yes". Leave the other parameters unchanged with the default values. We will search the default database, "swall", which is the Swiss-Prot non-redundant database combined with Trembl and TremblNew (Trembl = Translated EMBL and TremblNew = New sequences in Trembl).
  • Copy and paste the murine IL-7 receptor (IL-7R) sequence from this text file. The input sequence can be in any format.
  • Click the "Run Fasta3" button for the search results.
  • You may view the same results as a graphical output by clicking on the "VisualFasta" button from the "Results of Search" screen.
  • Interpretation of the results:
  • In general, one selects sequence similarities with E() value < 0.02 as statistically significant matches.
  • As expected, notice that the murine IL-7R sequences in the database (Accession numbers Q9R0C1, P16872) show the best similarities to the query sequence (with the highest opt and z-scores). Only two other sequences corresponding to the human IL-7R gene (Acc. #'s P16871, Q9UPC1) show fairly high opt and z-scores. Even the human protein isoforms (Acc. #'s P16871-02, P16871_01) with some identical residues to the query sequence have lower opt and z scores.
  • Try to change the parameters to see how they affect the results
  • BLAST Protein Similarity Search

  • Connect to the BLAST site at NCBI. Click the link to the "Standard protein-protein BLAST [blastp]" page. Familiarize yourself with the various features of the site.
  • Choose "nr" for the non-redundant database to search.
  • As with the FASTA exercise above, copy and paste the murine IL-7 receptor (IL-7R) sequence from this text file into the large data entry field. This sequence is already in the FASTA format.
  • Limit the number of hits to a manageable size by changing the Expect value to 1 from the default value of 10. You may also restrict the number of hits returned by decreasing the number of Descriptions and Alignments returned.
  • Use the default Blosum62 scoring matrix as selected at the bottom of the page.
  • Click on the "Search" button to perform the similarity search immediately. On the Blast CGI screen that shows up next, view the results by pressing the "Format results" button. You may also check for any conserved domains between your sequence and the database sequences. You may wish to get the BLAST results by e-mail by providing your e-mail address on the BLAST search screen.
  • Try to change the parameters to see how they affect the results
  • Smith-Waterman Algorithm

    Connect to the Bioccelerator site at EMBL to use the Smith-Waterman algorithm for sequence similarity searches. You have to figure out how to use the site and how to interpret the results. Try to change the parameters to see how they affect the results. Notes:

  • This search tool is a rigorously mathematical, dynamic programming algorithm that uses iterative calculation of similarity in matrix cells (pairwise comparisons between the query and database sequences).
  • Very computationally intensive and may take longer times for similarity searches.
  • Comparison of Fasta, BLAST and Smith-Waterman Search Results

  • Check for database sequences that have been pulled out as common hits by the three search algorithms. How do these sequences common to all searches show up in the graphical alignment figures from BLAST, FASTA and SW?
  • How do the statistical significance score for common sequences compare between the three programs?
  • Compare the interface in terms of accessibility, parameters, documentation and visualization capabilities
  • Submit by June 3rd

  • Send your report by email to Stefano Lonardi or drop it at his office (SURGE 320)