Uniq Integer

By Tomasz Kuczma

November 22, 2019

From time to time I like to solve some algorithmic tasks on Hackerrank. First of all, it is good to change the domain after constantly working on different kinds of challenges in professional life like solving business problems, fixing bugs and scaling the system. Secondly, it always reminds me of a very good period in my life - study times. Unfortunately, having an outstanding knowledge about algorithms is mostly useless for most developers. Many companies having department in Poland, where I worked for several years, totally skip this part during the interview process. As far as I know, only huge western companies with a big pool of tech talents like Amazon or Google use to ask interview questions mostly related to algorithm and general system designs than to specific technology but this is a topic for a separate article. In this article, I would like to talk about the challenge found a long time ago on Hackerrank and few cool extensions that jump to my head after I found the solution.

The generic task

There is an array of integers of size MN+L.
There are N+1 unique integers in this array.
One of them repeats L times (M!=L) and other N integers repeat M times.

Question: Find the integer which repeats L times.
Input: 
	M,N,L - integers
	array of integers of size MN+L
Example:
	Input: M=3, N=4, L=2,
	[1, 2, 3, 5, 1, 4, 3, 1, 3, 2, 2, 4, 5, 5]
Output: 4
Explanation:
	Number 4 is the only number which repeats 2 times (L times). Other numbers (1,2,3,5) repeats 3 times (M times).
Requirements:
	Can you do this in O(n) time complexity and O(1) space complexity?

As you can see, this assignment is not so tricky until you want to fulfill all the requirements.

Obvious solutions (which do not fulfill all requirements)

If you just want O(n) time complexity then you can use hashmap to count all the value and then find the one (by iterating through the map) which has different counter value. But this gives you O(n) space complexity. In Python, it looks like this:

def find(integers, l):
	from collections import defaultdict
	counters = defaultdict(int)
	for v in integers:
		counters[v] += 1
		#or one liner: counters = collections.Counter(integers)

	for k,v in counters.items():
		if v == l:
			return k
	# Assume there is always a solution (Python returns None anyway)

If you just want O(1) space complexity then you can use sorting. Then iterate through an array and check M-th and L-th succeeding elements. But this gives you O(n log(n)) time complexity. Of course, you can use bucket sort but it has to allocate extra space for buckets and it is still not O(n) time complexity.

def find(integers, m, l):
	integers.sort()
	print(integers)
	for i in range(0, len(integers) - max(m, l), m):
		if integers[i] == integers[i + l - 1] and (l > m or integers[i] != integers[i + m - 1]):
			return integers[i]
	# check last element
	if integers[-l] == integers[-1] and (l>m or integers[-m] != integers[-1]):
		return integers[-1]

So it looks like a typical CPU-RAM tradeoff . Fortunately, there is a better solution!

Let’s starts with a simpler version of this task

There is an array of integers of size 2N+1.
There are N+1 unique integers in this array.
One of them is unique and other N integers repeat 2 times.

Do you get a solution now? To be honest it took me some time to realize that the key here is XOR. It is a very powerful function because of its properties: commutativity and associativity. Thanks to that the solution can be simplified to:

def find(integers):
	unique = 0 # initiate with identity element for XOR operation
	for v in integers:
		unique ^= v
	return unique

def func_find(integers): # functional style 
	from operator import xor
	from functools import reduce
	return reduce(xor, integers, 0)

But wait. It doesn’t work when integers repeat more than twice (an odd number of times, to be exact). So this is impossible to find unique integer in array of NM+1 (N integers repeat M times). Can we do something about that? Yes, we can ;)

Let’s generalize the algorithm

I have to admit that the solution came to my mind immediately because of the alternative name of exclusive or: addition modulo 2. “Suma modulo 2” term is mostly used by a very linguistic correct Polish lecturers hence it was very easy ;). The alternative name says that result bit is a sum modulo 2 of the bit on the equivalent position in arguments. What if we could sum bits modulo N? That’s it! The generic solution! It can be also extended to the original problem:

from sys import int_info

class AdderModulo:
	def __init__(self, mod, initial_value=0):
		self.mod = mod
		# Simplification
		self.bit_position_values = [0 for i in range(int_info.bits_per_digit)]
		self.add(initial_value)

	def add(self, value):
		for i in range(int_info.bits_per_digit): # for every bit in value integer 
			current_bit = value & 1
			self.bit_position_values[i] += current_bit
			self.bit_position_values[i] %= self.mod
			value >>= 1
		return self  

	def int_value(self):
		value = 0
		for value_at_position in reversed(self.bit_position_values):
			value <<= 1
			value |= (value_at_position != 0)
		return value


def find(integers, m):
	from functools import reduce
	return reduce(AdderModulo.add, integers, AdderModulo(m)).int_value()

In Python, integers can have an “infinite” number of bits hence this solution is not universal in the Python world. I am using it for simplicity. In C++/Java like languages, we should use fixed-size array (of size 8*sizeof(int)) for performance reason. Note that L%M != 0.

It’s generic, efficient and scales!

The most interesting part is that the solution scales very well horizontally and vertically. This is very uncommon for Hackerrank’s assignment. The implementation is also quite good:

from the CPU perspective as there is no if statement which can break pipeline computing (compiler should unroll this “bit” loop).
from memory accesses point of view as sequent memory cells are accessed. This works well with the cache line mechanism. Compiler should also add optimal prefetching instructions which speeds it up even more. Moreover, we don’t need to load the entire integer list into memory.

The interesting thing is that it can be also extended to any data type - not only integers:

This type just has to be serializable and deserializable to some binary representation.
Or just hashable, so we can apply AdderModulo on hashes to find hash of the result. However, we have to find original value then by scanning input array and solve collision problem eg. by using Python’s Counter on elements matching result hash. This requires additional memory but a little bit (based on collision probability). Distributed processing can be also organized easily with map-reduce pattern. The proofs are left as an exercise to the reader ;)

And here is a link for the original task which is not so “tricky” with a current description as it was in the past.

I don’t know why such thoughts came to my mind when I solve algorithmic tasks even such trivial like using XOR with array. But this is why I love solving such tasks as it reminds me old good study times when there was always a time for divagation about algorithms their correctness, scalability, performance and possibility of extension. Do you sometimes feel that too?

Article by Tomasz Kuczma

Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.

The views I express are my alone and they do not necessarily express the views of my employer or ex-employers. They are not investment advice nor based on any non-public information of any kind. Poglądy, które wyrażam, są tylko moje i niekoniecznie wyrażają opinie mojego pracodawcy lub byłych pracodawców. Nie są poradami inwestycyjnymi ani nie opierają się na jakichkolwiek niepublicznych informacjach.