Memory efficient HashSet implementation for Java

In the previous post I introduced you to hash tables and how they work. Now I will show how I did implement it some time ago and saved more than 3.5 times the memory. I want to add though that it was a special case (works only on the hexadecimal strings) and most of the time it’s best not to reinvent the wheel.

So what had to be done was to save a hexadecimal strings as keys in a hash table and then check if a new incoming string is not already there. Basically a check for duplicates. Let’s see how much memory is taken if we solve it with just few lines :

public class Main {

	public static void main(String[] args) {
		Set<String> javaSet = new java.util.HashSet<>(SET_SIZE);
		for (int i = 0; i < SET_SIZE; i++) {
			String randomHexString = UUID.randomUUID().toString().replaceAll("-", "");
			javaSet.add(randomHexString);
		}
		
		System.out.println(MemoryMeasurer.measureBytes(javaSet));
	}

}

Result was 144388704, which is 144 MBs for one million strings. It is not that bad and probably would be the end of the story for most cases. But what if you want to keep a lot more and want to optimize for memory consumption?

Compressing strings

The first thing to notice is that inputs are hexadecimal strings. That means one letter can be converted to 4 bits (e.g. ‘F’ to 1111) and that means two letters can be converted to one byte! Instead of using 32 letter strings we can just use 16 elements byte array! On top of saving memory per one string character we are also saving memory because we don’t use String objects which have kinda large overhead (40 bytes vs 16 bytes for pure array). This is my converter between a hexadecimal string and an array. Here is a nice stackoverflow question about memory that java objects consume and here is the one with details on strings.

Once we decide to use byte[] instead of String we face a problem. Arrays derive hashCode() and equals() methods from an Object class itself so they won’t work correctly with java’s HashMap and HashSet. I still wanted to optimize more and using another object as a wrapper would add additional overhead so I decided to roll my own HashSet implementation.

Implementing HashSet

I will not copy all the source code here as that would take too much space. Instead I will explain the main idea and provide links to the source on github.

From my previous article you can see that a simple chaining hash table is quite simple to implement. All the values in a hash table are put into the buckets. That means it basically would be an array of linked lists of byte arrays. Something like this:

untitled-diagram

And that means that each input value is saved as a byte array which is an object in java. Each object is additional memory overhead. So if all the values are of fixed size (which in this situation is true) then it’s very easy just to merge all those byte arrays into one large array and keep them there. Hash table then becomes an array of linked lists of integers who just points to an index in the big array where the value is. To save even more space instead of a linked list a simple array was used for a bucket. If bucket gets full, array is resized to accommodate more data. It is downsized if elements are deleted and space is not needed anymore.

So it looks something like this (source is here)

diag

Implementation notes

One thing to note is that by using a big array for data we allocate a lot of memory right from the start. That could be solved by starting with a smaller one and increasing it’s size gradually. This implementation however does not do that. Another special case for this implementation is the bounded size. Once the data array is full the oldest element is overwritten with the new one. Again it depends on the requirements – it could be like that (good for checking for duplicates over last n entries which was the case) or it could grow infinitely. One more thing in this implementation is that hash table size and data array size is the same. It is trivial to change though if you want to have a load factor of more than one. This HashSet also do not have remove operation implemented simply because it didn’t need one but again it’s also just a standard remove from hash table. Data array can be kept as is as without a pointer in hash table the value can not be retrieved and simply will be overridden later.
Here is the source code to the main logic of hash table insert and search operations.

Now let’s try to compare our new HashSet using a string conversion trick with our initial implementation.


	public static void main(String[] args) {
		Set<String> javaSet = new java.util.HashSet<>(SET_SIZE);
		Set<String> memoryEfficientSet = new org.intelligentjava.HashSet(SET_SIZE);
		String[] simpleArray = new String[SET_SIZE];
		for (int i = 0; i < SET_SIZE; i++) {
			String randomHexString = UUID.randomUUID().toString().replaceAll("-", "");
			javaSet.add(randomHexString);
			memoryEfficientSet.add(randomHexString);
			simpleArray[i] = randomHexString;
		}
		
		System.out.println(MemoryMeasurer.measureBytes(javaSet));
		System.out.println(MemoryMeasurer.measureBytes(simpleArray));
		System.out.println(MemoryMeasurer.measureBytes(memoryEfficientSet));
	}
	

My results: 144388704 for java’s HashSet, 108000016 for a simple String array and 40383488 for my HashSet implementation. As you can see we save more than 3.5 times of the memory. And probably it could still be improved. One thing I see is to use open addressing hashing but let’s leave that for the next time. As always you can find my code in github and try it for yourself! Looking forward to your feedback. Happy coding and see you later!

Advertisements

4 thoughts on “Memory efficient HashSet implementation for Java

  1. I actually once implemented a HashMap using double hashing because we had issues with all the Entry objects in a large map. Worked like a charm.

    1. Yeah here was a the similar problem – all those objects takes a lot of space in Java. Actually this implementation I think would save even more space by using open addressing with double hashing. Maybe hash table would have to be a bit bigger but now hash table have buckets which is an array which is a Java object and that means memory overhead. Getting rid of buckets probably would save memory in the end.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s