Spell Checking
Summary
Your assignment is to implement an primitive spelling-correction tool.
Details
This assignment will require you to create a spelling-correction tool
based on two notions: edit-distance and letter-frequency equivalence.
Edit Distance
In this assignment, we will define edit distance as the number of
point edits (an insertion, deletion, mutation of
a single character, or the transposition (swapping) of two
adjacent characters.) Since there are 4 operations possible at each
location, the time required to naively calculate the distance between
two strings of length n is O(4n). However,
there exists an algorithm that will calculate this distance in
O(n2). You will need to implement this dynamic
programming algorithm. There are descriptions available online if you
are unable to discover the algorithm yourself, but few (if any) of
them include the transposition operation: you'll have to extend the
algorithm yourself to handle that.
Letter Frequency Equivalence
As a tie-breaker for Edit Distance, we will use the notion of letter
frequency equivalence. The letter frequency equivalence of two words
s and t is the number of characters that appear the same
non-zero number of times in both words.
Some examples can make this concrete:
- "cat" and "the" have an LFE of 1: "t" appears once in both, no
other characters are shared.
- "cat" and "ate" have an LFE of 2: "t" and "a" both appear
once.
- "apple" and "pillar" have an LFE of 1: "a" appears 1 time in
both. ("p" appears twice in "apple" and once in "pillar", and "l"
appears twice in "pillar" and once in "apple", so neither
counts.)
This can (and should) be calculated in O(n) time.
The Algorithm
Read text from standard input until EOF. For each whitespace
separated token, strip all non alphabetic characters and then check
against the standard UNIX dictionary /usr/share/dict/words. If
the word appears in the dictionary, print it followed by a single
space. If it does not, check the dictionary for the set of words with
the smallest Edit Distance from the "misspelled" word in question.
Next, find the word or words in that set that have the largest LFE
with the word in question. If more than one word matches these
criteria equally, use the one occurring alphabetically first. Print
out the best match, followed by a single space.
Note that this is a rough approximation to a spell-checking algorithm,
but is quite slow by comparison, and not nearly as accurate. This
should give you some notion of what goes into a spell checker, and
give you some practice writing dynamic programming algorithms in real
applications, but this is not meant to be top quality
spell-checking.
Test files for this assignment can be found here:
as3test1.in as3test1.out
as3test2.in as3test2.out
Note that these tests are set to work on hill.cs.ucr.edu. Any
machine that has a different set of words in /usr/share/dict/words will
likely produce different output.
Your program will be graded automatically for producing identical
outputs on the same inputs. It is strongly recommended that you use
diff like in lab to verify your program's correctness.