Hash Code - Problems
By Tomasz Kuczma
Hash code is the crucial thing in hash-based algorithms like those used in hash maps and all problems come from that simple fact. Its efficiency is as important as the efficiency of the hashing algorithm itself. Let’s talk about those problems and how to solve them.
Why hash code can cause a problem?
The main problems are:
- Implementation determines the collision probability. This is an important factor and people tend to forget or underestimate it.
hashCode()
should be implemented together withequals(Object)
method as they are used jointly. The perfect implementation of those methods should correspond to each other e.g. use the same fields/methods and the same precedence.- You can easily forget to modify those two methods when you add a new field or modify it partially. This can sometimes break the correctness of the program. Of course, beside the case when you don’t want to use the new fields for objects comparison and hashing.
Simple hash collision examples
Here is my list of simple hash collisions (based on OpenJDK 11):
- Hash code of
null
is 0 so it collides with integer and float 0:
assert 0 == Objects.hashCode(null);
assert 0 == Integer.hashCode(0);
assert 0 == Float.hashCode(0.0f);
Your object can return hash 0 and then it collides with null
too.
- It also happens in plain arrays and lists:
Integer[] ints = {null, 7};
Integer[] ints2 = {0, 7};
assert Arrays.hashCode(ints) == Arrays.hashCode(ints2);
- Long uses xor of high-order bits and low-order bits as hash code which has interesting effect:
assert 0 == Long.hashCode(7L<<32 | 7);
long x = (long) anyInt();
assert 0 == Long.hashCode(x<<32 | x);
long y = (long) anyInt();
assert Long.hashCode(x<<32 | y) == Long.hashCode(y<<32 | x);
- Array hash codes can collide easily when there is an element with negative hash:
Integer[] ints = {-31, 7};
Integer[] ints2 = {-31, 0, 0, 0, 7};
assert Arrays.hashCode(ints) == Arrays.hashCode(ints2);
int x = anyInt();
Integer[] ints3 = {-31, x, 0, -31*31*x, 7};
assert Arrays.hashCode(ints) == Arrays.hashCode(ints3);
If you don’t get why -31 “hacks” hash codes checkout previous article about Java collections .
- Empty
HashMap
andHashSet
have hashes equal to 0.
HashSet<Integer> set = new HashSet<>();
assert 0 == set.hashCode();
set.add(0);
assert 0 == set.hashCode();
- Moreover
HashMap
uses xor key hash code and value hash code hence:
HashMap<Integer, Integer> map = new HashMap<>();
assert 0 == map.hashCode();
map.put(0, 0);
assert 0 == map.hashCode();
map.put(1, 1);
assert 0 == map.hashCode();
int x = anyInt();
map.put(x, x);
assert 0 == map.hashCode();
And even more tricky:
HashMap<Integer, Integer> map = new HashMap<>();
map.put(1, -2);
map.put(-2, 1);
assert 0 == set.hashCode();
xor is fun. Isn’t it? :) check out my article Unique Integer for more fun with low-level algorithms optimization.
All of those examples are only the basic ones and can be easily combined together to generate collision.
How to do not forget to modify hashCode()
and equals(Object)
methods?
That’s actually a tricky question. From the one side, we could just add a rule to our code analyzer like SonarQube to detect that.
From the other side sometimes we intentionally want to omit a new field in the equals
method.
Looks like the best advice here is to be aware of it and have good code reviewers in our teams.
One thing that could help (and is one of my best tools in Java) is Lombok.
You could read how Lombok saved my ass
in one of the previous articles.
The true power of @Data
and @Value
annotations comes from generating equals
and hashCode
(besides hiding plain Java boilerplate).
You can also use Kotlin’s data classes or Java’s records (preview feature in JDK 14).
In the next episode
It is 3rd episode but definitely not the last one. We barely touch the tip of the iceberg in the hashing world but it is an important introduction to understand the pros, cons and complexity of the hashing algorithm.
Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.