Introduction to hash tables

Some time ago I wrote a post about binary search trees. I mostly described them as a data structure for more efficient search than naive search in an array. But in fact another data structure that is very very widely used is much better if you need only the basic set operations – add, search, delete. It’s a hash table – one of the most used data structure in programming. I am sure you are already using it a lot with HashMaps or HashSets from java collections. Binary search trees can not compete with their O(1) time complexity. However trees have additional advantages – they require less memory and additionally to basic add/search/delete operations they support traversing elements in order, finding closest elements, searching in ranges – things that hash tables are not designed for. It is designed for retrieving elements fast and that’s basically it. So if you don’t need anything else – you should use hash table. In this post I will explain how they work and share my very simple sample implementations.

Hash table idea is very simple. For each element we insert we calculate its unique address, its index in the array and we put it to that “address”. Then once we need to find it, we again calculate its index and just return it with a single operation (something like return array[calculatedIndex]). That’s why it is has O(1) time complexity. The process of calculating element’s unique address is called hashing and the algorithm how it is done is called a hash function.

Hash table illustration
Hash table illustration

Now you probably already see a problem. How does hash function manages to give unique address to every element? Well it doesn’t. It tries to give an unique address but how well it succeeds depends on how good a hash function is. If two different elements get the same hash it’s called collision and the two most popular ways to solve it are chaining and open addressing (explained later). If you already know all the inputs in advance it is possible to create a perfect hash function (when each hash is guaranteed to be unique). Check out tools like gperf or CMPH for that.

Hash function

When you are writing a hash function your main goal is that all the hashes would be distributed uniformly. That will minimize probability of collisions. For example if you know that all the values are strings with different lengths then a simple string length can be a good hash function. However if your values are English words then such a hash function would perform very poorly as majority of words are 4-10 letter words so there would be a lot of collisions. Another example of a simple hash function is for real numbers between 0 and 1. Sufficiently good hash function could be value * hashTableSize (if input values are uniformly distributed). Those are just few simple examples to illustrate how you can think about hash functions. There are of course other hash functions developed by mathematicians and probably you should do your homework before choosing one for your particular use case. Here is one interesting comparison of some well known hash functions.

Solving collisions

As I previously mentioned there are two ways to solve collisions problem. First one is chaining when we simply use a linked list and put all the values with the same hash into it. So basically hash table becomes an array of buckets (linked lists) where the hash points to the bucked where the element can be found. For that reason if hash function simply returns a constant then all the elements would end up in a linked list and worst case search performance from our expected O(1) would degrade to O(n)! That’s why it is so important to choose a good hash function for your data.

Hash table with chaining illustration
Hash table with chaining illustration

Here is my very simplified java implementation of chaining. Hash table is an array of linked lists with no dynamic resizing and it supports only String values but it’s fine for illustrative purposes. Code actually is pretty self explanatory – calculate hash, find bucket, add/remove/find value in the bucket.

public class HashTable {
	
	private static int SIZE = 1000;
	
	private LinkedList<String>[] hashtable = (LinkedList<String>[])new LinkedList[SIZE];
	
	public void add(String value) {
		int hash = hash(value);
		if (hashtable[hash] == null) {
			hashtable[hash] = new LinkedList<>();
		}
		LinkedList<String> bucket = hashtable[hash];
		bucket.add(value);
	}
	
	public boolean contains(String value) {
		int hash = hash(value);
		LinkedList<String> bucket = hashtable[hash];
		if (bucket != null) {
			Iterator<String> it = bucket.iterator();
			while (it.hasNext()) {
				if (it.next().equals(value)) {
					return true;
				}
			}
		}
		// value not found
		return false;
	}

      // exactly the same as contains() just additionally remove value
	public boolean remove(String value) {
		int hash = hash(value);
		LinkedList<String> bucket = hashtable[hash];
		if (bucket != null) {
			Iterator<String> it = bucket.iterator();
			while (it.hasNext()) {
				if (it.next().equals(value)) {
					it.remove();
					return true;
				}
			}
		}
		// value not found
		return false;
	}
	
	// use guave library for hash function
	private int hash(String value) {
		HashFunction hf = Hashing.murmur3_128(); // can choose any hash function
		return Math.abs(hf.newHasher()
	       .putString(value, Charsets.UTF_8)
	       .hash().asInt()) % SIZE;
	}
}

Open addressing

In an open addressing method all the elements occupy hash table array itself. There are no buckets – only a value or null. That means that hash table can fill up and no further insertions are possible. However because it does not need to keep additional pointers it’s more memory efficient so you can allocate a larger array which should result in less collisions and possibly better performance. So how do we solve collision problem after all? When inserting an element we probe hash table until we find a place for it. In other words we calculate hash then check if the slot is empty if not calculate another hash and check that slot and continue until a place for the element is found. To do that we extend hash function with additional parameter – probe index.

Hash table with open addressing
Hash table with open addressing

Here is my implementation in java:

public class OpenAddressingHashTable {

	private static int SIZE = 1000;

	private String[] hashtable = new String[SIZE];

	public void add(String value) {

		int probe = 0;

		do {
			int hash = hash(value, probe);
			if (hashtable[hash] == null) {
				hashtable[hash] = value;
				return;
			}
			probe++;
		} while (probe < SIZE);

		// hash table is full
		throw new RuntimeException("Hash table overflow");
	}

	public boolean contains(String value) {
		int probe = 0;

		do {
			int hash = hash(value, probe);
			if (hashtable[hash] == null) {
				return false;
			} else {
				if (hashtable[hash].equals(value)) {
					return true;
				}
				probe++;
			}

		} while (probe < SIZE);

		return false;
	}

	private int hash(String value, int probe) {
		HashFunction hf = Hashing.murmur3_128(); // can choose any hash function
		int hash = Math.abs(hf.newHasher().putString(value, Charsets.UTF_8).hash().asInt()) % SIZE;

		return (hash + probe) % SIZE;
	}

}

The algorithm is quite simple – run the loop incrementing probe variable and recalculate hash for different probes until an empty slot is found (so for add() operation insert can be performed and for contains() operation we can know that the element is not in the table) or until probe index reaches size of the hash table which then means that the table is full.

This exact hash function extension with probes is called linear probing. Formula for it is:

h(value, probe) = (h(value) + 1) mod size

It is easy to implement but not necessarily the best. It might increase sequences of occupied slots which might decrease the performance. There are other methods like quadratic probing and double hashing which might be a better choice but I skip it here since this is only an introductory article. Lastly I did not implement delete operations as it is a little bit more complicated. In essence it can be implemented the same way just it can’t set null value when removing and instead it should be some special value DELETED. We can not set null values because then search would start working incorrectly and some values might be lost because chain of probes could be broken. I wanted to keep code as clean as possible for illustrative purposes so I decided to skip remove operation. You can easily do it yourself and it could be a good exercise actually.

To summarize you will probably never need to implement a hash table yourself and java collection’s HashMap and HashSet will be enough for majority of use cases. However it’s always good to understand how it works and sometimes in some special rare use cases you might need to develop and improve your own algorithm. In the next article I will show one example of a special case where you can save some memory if you implement hash table yourself.

So after reading this piece you now know and understand the basics about hash functions and importance of its choice, differences between chaining and open addressing, what linear probing is and have some very simple and clear code to start working on after learning more nuances (you can start of course in wikipedia). However in real world it’s better to use the battle tested and optimized library. For example HashMap even starts using a binary tree data structure instead of linked list in a bucket if it the bucket becomes quite large. Anyway I hope you liked this post and as always you can find code on my github. Don’t forget to override your hashCode and equals and happy coding!