Hash Code - Java's collections

By Tomasz Kuczma

February 7, 2020

Welcome back to Hash Code miniserie where you can read how the hash codes (non-cryptographic hashing algorithms) and hash collections work in different programming languages. This time let’s take a look under the hood of Java’s collections and Strings. How hash codes are generated for them? Let’s check it out.

Arrays

Arrays in Java do not provide its own hashCode() implementation - it uses Object default which can cause a lot of error as hash depends on reference (an instance), not on its value. Java also provides better implemenation via Arrays helper class. Let’s take a look into the code:

public static int hashCode(Object[] a) {
	if (a == null)
		return 0;

	int result = 1;

	for (Object element : a)
		result = 31 * result + (element == null ? 0 : element.hashCode());

	return result;
}

OpenJDK 11 implementation.

This algorithm is just an effective implementation of: sum_i=0^n i^(n-i) as it does much less multiplications - this optimization is called addition-chain exponentiation. For primitives, Arrays uses the same algorithm but using type-specific hash which has been discussed in the previous article . Note that 0 is used as hash code of null.

There is a good question to ask here: Why do we need this 31? Can’t we just add all hashes? Indeed, we could just add all elements hashes as value x should have different hash than value y. But there is one tinny detail there. How would you distinguish those two arrays then: [1,2], [2,1]? Without multiplication, the hash code for those two arrays would be the same - 3. With multiplication we get (31*1 + 1)*31 + 2=994 for the first array and (31*1 + 2)*31 + 1=1024 for the second. We swap 2 numbers and hash differs significantly - this is a very good feature.

This behavior is common for hash functions and it is mostly demonstrated on cryptographic hash functions like SHA-256. This is a completely different class of hashing and material for the separate article :)

It also returns different values for arrays of different length even for zeros array e.g. hashCode([0,0,0]) != hashCode([0,0]), thanks to that trick.

Why 31?

The answer is in Joshua Bloch’s Effective Java book:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

The performance trick might not work on modern CPUs.

How does it work for List, Map and Set?

ArrayList and LinkedList (via inheritance from AbstractList) use the same algorithm as Arrays.hashCode for the same reason. HashSet (AbstractList to be exact) class looks interesting as it simplifies algorithm to the sum of hash of each element - without n additional multiplications (*31) which save CPU time. This makes a lot of sense since this data structure stores only unique elements and the order of those elements doesn’t matter (or stays the same) as iteration order for HashMap/HashSet isn’t guaranteed by JVM. It works the same way for TreeMap and EnumMap.

So how does it work for maps then? Map (via AbstarctMap) simply add hashes of all entries (like a set) where entry hash is xor of key hash and value hash.

Suspicious thing is that empty arrays and lists have hash codes equal 0 where maps and sets have hash codes equal 1. Another interesting fact is that ConcurrentHashMap implements proper hashCode method but standard queues like PriorityQueue and concurrent queues do not. So, I can create HashSet of ConcurrentHashMap but cannot of PriorityQueue.

Why? I don’t know :(

Strings

String uses the same algorithm as List no matter if it uses compact string (Latin1 encoding - 1 byte per character; available since Java 11) or not (UTF-16 encoding - 2 bytes per character). The best thing is that String’s hash code is cached and calculated lazy during the first call. This optimization is possible as the entire string is immutable (including string elements). This trick speeds up subsequent hash collections lookups very much while using the same string. Hash code is also copied when string is cloned.

It’s not the end

In the next article, I will write about the biggest problems with the hash code.

Article by Tomasz Kuczma

Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.

The views I express are my alone and they do not necessarily express the views of my employer or ex-employers. They are not investment advice nor based on any non-public information of any kind. Poglądy, które wyrażam, są tylko moje i niekoniecznie wyrażają opinie mojego pracodawcy lub byłych pracodawców. Nie są poradami inwestycyjnymi ani nie opierają się na jakichkolwiek niepublicznych informacjach.