CS141 BB: HashTable

Hash Table functionality

A hash table is a data structure supporting the following kinds of "Dictionary" operations:

set_value(key, value) - set the value associated with the given key
get_value(key) - return the last value
exists(key) - return whether the key has any value assigned to it
remove(key) - remove any assignment of a value to the given key

A hash table may also provide other operations such as iteration over the keys, or returning the set of keys that have values assigned to them.

Under some reasonable assumptions, the hash table supports each of these operations in constant (O(1)) time per operation.

Hash Table implementation

A hash function, F, is a function that, given a key, returns an integer. The function should be computed quickly.

A bucket array is an array of "buckets". The implementation determines the number of buckets. THe idea is that, given a key, you can quickly (in O(1) time) tell which bucket that key will be stored in. The standard way to do this is to compute the hash function value for the key, F(key), and then take the remainder of that when you divide by N. That is, the key is "hashed" into the Jth bucket, where J = F(key) % N. (The "%" operator is "mod" in C++).

You also store with each key (in its bucket) the value most recently assigned to the key.

A typical implementation implements each bucket as a linked list (or array) of (key, value) pairs.

To find whether a key exists, or to find the value associated with the key, it's enough to look in the one bucket that the key hashes to (the Jth bucket, where J = F(key) % N). (One does not have to look through all of the buckets.)

An important part of implementing the hash table is to make sure that the table size N is approximately proportional to the total number of keys in the table. This is because a typical bucket will contain around

: #keys total / N

keys, and we want a typical bucket to contain O(1) keys (even when the total # keys in the table is large). We want this because once we've figured out which bucket a key belongs to, we have to search all the keys in that bucket to find the one we're looking for. This takes time proportional to the number of keys in the bucket.

Rehashing

Some implementations will require the size of the hash table to be specified in advance, and not to change throughout the entire computation. A nicer implementation will automatically adjust the size of the table to stay roughly proportional to the number of keys. This requires periodically changing the size of the table (N).

Note that when the size of the table changes, the proper bucket for each key changes. (Because if the new size of the table is N', it is unlikely that F(key) % N = F(key) % N'.) This means that when the table is resized, all of the keys need to be rehashed (taken out of the old table, and reinserted in the proper buckets of the new table). This takes time linear in the size of the tables and the number of keys.

But with a careful choice of resizing rules (as in the GrowableArray), the total work for M hash table operations will still be O(M), even if some resizes are required.

References:

Section 2.5 of GoodrichAndTomassia (read through "Load factors and rehashing")