The array has n items all unique except only one item. We need to come up with an algorithm that tells us which item is duplicated in maximum time of

O(n) where n is the length of the array.

To do that, let's first come up with some (less efficient) working algorithms. Here's the first one:

We loop over the array starting from the item at index 0, then we traverse the remaining part of the array (1 : n-1) searching for the first item

The C# code for this algorithm looks like the following:

static int FirstDuplicate(int[] arr) { for(int i = 0; i < arr.Length - 1; i++) { for(int j = i + 1; j < arr.Length; j++) { if (arr[i] == arr[j]) return arr[i]; } } return -1; }As you can see, this algorithm is pretty bad. Foreach item in the array an inner loop is initiated to linerally look for that specific element in the

rest of the array. If you do the math, you shall find that this algorithm runs in O(n2) order of growth.

One way to improve this is to use an extra HashSet to store the items, and then look up each item in the HashSet. This is considered improvement as the

lookup inside the HashSet is really fast.

Here's the code in C#:

static int FirstDuplicateWithHashSet(int[] arr) { HashSetThis is pretty good, but still not O(n).hashHset = new HashSet for(int i = 0; i < arr.Length; i++) { if (hashHset.Contains(arr[i])) return arr[i]; hashHset.Add(arr[i]); } return 0; }();

The next algorithm is quite tricky. The idea simply is to create a second array and insert each elemnt in the first array at an index equivalent to its

value in the second array.

For example if we have a list of 5 items [2, 4, 5, 2, 6], where the item 2 is duplicated at the 0 index and third index, and the maximum value in this

array is 6. Now to find the duplicates in this list we create a second list with length 6 ( the length of the second array equals the maximum value in

the first array). After creating this second list we loop over the first array take the first item (2 in our case) and insert it at the index 2 in

the second array, then we take the second item (which is 4) and insert it at the 4th index in the second array, and so on. Each time we try to insert

an item in the second array we check if it has a value first, if it is, then this item is duplicated.

here's how the code would look like:

static int FirstDuplicate(int[] arr, int maxVal) { int[] temp = new int[maxVal+1]; for(int i =0; i < arr.Length; i++) { if (temp[arr[i]] == arr[i]) return arr[i]; temp[arr[i]] = arr[i]; } return 0; }Note: The method expects the maximum value as an input, however, if you don't get the maximum value you can create the temp array with a size

equal to int.MaxValue (which is not a good idea)

This algorithm is probably not realistic but it doesn run in O(n) time.

One more trick to add here, if you know the range of the items in the array (e.g. from 1 to 10) you can get the sum of the numbers of the array

then subtract from it the sum of the numbers from 1 to 10, the remainder is the duplicated value.

That was a quick tip that I thought is cool and wanted to share with ya! So what do you think dear fellows? Do you know of any possibly better

algorithms? Do you suggest any optimization to the current ones?

## 5 comments:

The algorithm you describe is called "bucket sort".

Also, why would the second algorithm not be O(n)? Assuming your HashSet has amortized O(1) operations, it should be.

Also, you can read this:

http://aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html

@shezi.de Very interesting link indeed. The hashset algorithm is very fast it's the constant factor of hashing and collision resolving that might make it a little slower than the other one.

@shezi.de

A hash-set would be expected O(1) not AMORTIZED O(1).

@ Galilyou

The hashSet based algorithm is expected O(n) assuming the hashSet provides O(1) expected insertions and membership checking (which it should). Since the big-O notation takes care of the "constant factor of hashing", there is no reason to claim that the algorithm isn't O(n).

@kink hands down. I stand corrected. Thanks for the nice clarification.

Post a Comment