Spell Checking

Summary

Your assignment is to implement an primitive spelling-correction tool.

Details

This assignment will require you to create a spelling-correction tool based on two notions: edit-distance and letter-frequency equivalence.

Edit Distance

In this assignment, we will define edit distance as the number of point edits (an insertion, deletion, mutation of a single character, or the transposition (swapping) of two adjacent characters.) Since there are 4 operations possible at each location, the time required to naively calculate the distance between two strings of length n is O(4n). However, there exists an algorithm that will calculate this distance in O(n2). You will need to implement this dynamic programming algorithm. There are descriptions available online if you are unable to discover the algorithm yourself, but few (if any) of them include the transposition operation: you'll have to extend the algorithm yourself to handle that.

Letter Frequency Equivalence

As a tie-breaker for Edit Distance, we will use the notion of letter frequency equivalence. The letter frequency equivalence of two words s and t is the number of characters that appear the same non-zero number of times in both words.

Some examples can make this concrete: This can (and should) be calculated in O(n) time.

The Algorithm

Read text from standard input until EOF. For each whitespace separated token, strip all non alphabetic characters and then check against the standard UNIX dictionary /usr/share/dict/words. If the word appears in the dictionary, print it followed by a single space. If it does not, check the dictionary for the set of words with the smallest Edit Distance from the "misspelled" word in question. Next, find the word or words in that set that have the largest LFE with the word in question. If more than one word matches these criteria equally, use the one occurring alphabetically first. Print out the best match, followed by a single space.

Note that this is a rough approximation to a spell-checking algorithm, but is quite slow by comparison, and not nearly as accurate. This should give you some notion of what goes into a spell checker, and give you some practice writing dynamic programming algorithms in real applications, but this is not meant to be top quality spell-checking.

Test files for this assignment can be found here:
as3test1.in as3test1.out
as3test2.in as3test2.out

Note that these tests are set to work on hill.cs.ucr.edu. Any machine that has a different set of words in /usr/share/dict/words will likely produce different output.

Your program will be graded automatically for producing identical outputs on the same inputs. It is strongly recommended that you use diff like in lab to verify your program's correctness.