Hash Code - Java's collections
By Tomasz Kuczma
Welcome back to Hash Code miniserie where you can read how the hash codes (non-cryptographic hashing algorithms) and hash collections work in different programming languages. This time let’s take a look under the hood of Java’s collections and Strings. How hash codes are generated for them? Let’s check it out.
Arrays
Arrays in Java do not provide its own hashCode()
implementation - it uses Object
default which can cause a lot of error as hash depends on reference (an instance), not on its value.
Java also provides better implemenation via Arrays
helper class.
Let’s take a look into the code:
public static int hashCode(Object[] a) {
if (a == null)
return 0;
int result = 1;
for (Object element : a)
result = 31 * result + (element == null ? 0 : element.hashCode());
return result;
}
OpenJDK 11 implementation.
This algorithm is just an effective implementation of: sum_i=0^n i^(n-i)
as it does much less multiplications - this optimization is called addition-chain exponentiation.
For primitives, Arrays
uses the same algorithm but using type-specific hash which has been discussed in the previous article
.
Note that 0 is used as hash code of null
.
There is a good question to ask here: Why do we need this 31? Can’t we just add all hashes?
Indeed, we could just add all elements hashes as value x should have different hash than value y. But there is one tinny detail there.
How would you distinguish those two arrays then: [1,2]
, [2,1]
?
Without multiplication, the hash code for those two arrays would be the same - 3
. With multiplication we get (31*1 + 1)*31 + 2=994
for the first array and (31*1 + 2)*31 + 1=1024
for the second.
We swap 2 numbers and hash differs significantly - this is a very good feature.
This behavior is common for hash functions and it is mostly demonstrated on cryptographic hash functions like SHA-256. This is a completely different class of hashing and material for the separate article :)
It also returns different values for arrays of different length even for zeros array e.g. hashCode([0,0,0]) != hashCode([0,0])
, thanks to that trick.
Why 31?
The answer is in Joshua Bloch’s Effective Java book:
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
The performance trick might not work on modern CPUs.
How does it work for List, Map and Set?
ArrayList
and LinkedList
(via inheritance from AbstractList
) use the same algorithm as Arrays.hashCode
for the same reason.
HashSet
(AbstractList
to be exact) class looks interesting as it simplifies algorithm to the sum of hash of each element - without n
additional multiplications (*31
) which save CPU time.
This makes a lot of sense since this data structure stores only unique elements and the order of those elements doesn’t matter (or stays the same) as iteration order for HashMap/HashSet
isn’t guaranteed by JVM.
It works the same way for TreeMap
and EnumMap
.
So how does it work for maps then?
Map
(via AbstarctMap
) simply add hashes of all entries (like a set) where entry hash is xor of key hash and value hash.
Suspicious thing is that empty arrays and lists have hash codes equal 0 where maps and sets have hash codes equal 1.
Another interesting fact is that ConcurrentHashMap
implements proper hashCode
method but standard queues like PriorityQueue
and concurrent queues do not.
So, I can create HashSet
of ConcurrentHashMap
but cannot of PriorityQueue
.
Why? I don’t know :(
Strings
String
uses the same algorithm as List
no matter if it uses compact string (Latin1 encoding - 1 byte per character; available since Java 11) or not (UTF-16 encoding - 2 bytes per character).
The best thing is that String
’s hash code is cached and calculated lazy during the first call. This optimization is possible as the entire string is immutable (including string elements).
This trick speeds up subsequent hash collections lookups very much while using the same string.
Hash code is also copied when string is cloned.
It’s not the end
In the next article, I will write about the biggest problems with the hash code.
Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.