What is a hash table? When to use? Why use it?
Given a (key, value) pair, hash table is data structure which converts the key to an index and then store the value somewhere using that index.
Generally, we can think of it a as a mapping from key to value.
Hash function generates index of buckets or slots using key.
Ideally, it maps key to a unique bucket. The best situation is it provides a uniform distribution of hash values.
Operations cost:
Constant, amortized O(1)
worst case O(n) for search & deletion (add flag to fix it)
Size:
power of 2
prime number, good for even poorly designed hash function
Resizing (resize + insert again O(n) in average (n/2)
Load factor = (# of entries) / (# of buckets)
keep size within a good range (not too many collision, not too large wasted memory)
Open Addressing (close hashing) is method of collision resolution in hash tables. It’s resolved by probing. Probing includes linear probing, quadratic probing and double hashing
Linear probing:
Search: Find next available slot.
Insert:
Delete (Trouble):
Iterate through the following slots until find an empty slot or move one back to this slot (recursively)
Or use flag
Problems happen when there’s primary clustering. It means two records are mapped to the same index which causes one of them to move. And afterwards collisions increases during inserting more key value pair into hash table. A tendency that a collision will cause more nearby collisions.
With poorly designed hash function, i.e., hash function that would not make inputs uniformly distributed, linear probing could be slower than quadratic probing and double hashing.
Someone implement hash table and it is slow, why?
poor hash function
bad open addressing strategy. Waral
Use hash table to store data, but there is much more data than the machine's RAM, how to deal with that? add one more machine, rehash and reconstruct the hash table