Learner
Learner

Reputation: 2546

Find duplicate entry in array of integers

As a homework question, the following task had been given:

You are given an array with integers between 1 and 1,000,000. One integer is in the array twice. How can you determine which one? Can you think of a way to do it using little extra memory.

My solutions so far:

Solution 1

  1. List item
  2. Have a hash table
  3. Iterate through array and store its elements in hash table
  4. As soon as you find an element which is already in hash table, it is the dup element

Pros

It runs in O(n) time and with only 1 pass

Cons

It uses O(n) extra memory

Solution 2

  1. Sort the array using merge sort (O(nlogn) time)
  2. Parse again and if you see a element twice you got the dup.

Pros

It doesn't use extra memory

Cons

Running time is greater than O(n)

Can you guys think of any better solution?

Upvotes: 42

Views: 11776

Answers (9)

hughdbrown
hughdbrown

Reputation: 49043

This python code is a modification of QuickSort:

def findDuplicate(arr):
    orig_len = len(arr)
    if orig_len <= 1:
        return None
    pivot = arr.pop(0)
    greater = [i for i in arr if i > pivot]
    lesser = [i for i in arr if i < pivot]
    if len(greater) + len(lesser) != orig_len - 1:
        return pivot
    else:
        return findDuplicate(lesser) or findDuplicate(greater)

It finds a duplicate in O(n logn)), I think. It uses extra memory on the stack, but it can be rewritten to use only one copy of the original data, I believe:

def findDuplicate(arr):
    orig_len = len(arr)
    if orig_len <= 1:
        return None
    pivot = arr.pop(0)
    greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
    lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
    if len(arr):
        return pivot
    else:
        return findDuplicate(lesser) or findDuplicate(greater)

The list comprehensions that produce greater and lesser destroy the original with calls to pop(). If arr is not empty after removing greater and lesser from it, then there must be a duplicate and it must be pivot.

The code suffers from the usual stack overflow problems on sorted data, so either a random pivot or an iterative solution which queues the data is necessary:

def findDuplicate(full):
    import copy
    q = [full]
    while len(q):
        arr = copy.copy(q.pop(0))
        orig_len = len(arr)
        if orig_len > 1:
            pivot = arr.pop(0)
            greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
            lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
            if len(arr):
                return pivot
            else:
                q.append(greater)
                q.append(lesser)
    return None

However, now the code needs to take a deep copy of the data at the top of the loop, changing the memory requirements.

So much for computer science. The naive algorithm clobbers my code in python, probably because of python's sorting algorithm:

def findDuplicate(arr):
    arr = sorted(arr)
    prev = arr.pop(0)
    for element in arr:
        if element == prev:
            return prev
        else:
            prev = element
    return None

Upvotes: 6

lavinio
lavinio

Reputation: 24319

The question is a little ambiguous; when the request is "which one," does it mean return the value that is duplicated, or the position in the sequence of the duplicated one? If the former, any of the following three solutions will work; if it is the latter, the first is the only that will help.

Solution #1: assumes array is immutable

Build a bitmap; set the nth bit as you iterate through the array. If the bit is already set, you've found a duplicate. It runs on linear time, and will work for any size array.

The bitmap would be created with as many bits as there are possible values in the array. As you iterate through the array, you check the nth bit in the array. If it is set, you've found your duplicate. If it isn't, then set it. (Logic for doing this can be seen in the pseudo-code in this Wikipedia entry on Bit arrays or use the System.Collections.BitArray class.)

Solution #2: assumes array is mutable

Sort the array, and then do a linear search until the current value equals the previous value. Uses the least memory of all. Bonus points for altering the sort algorithm to detect the duplicate during a comparison operation and terminating early.

Solution #3: (assumes array length = 1,000,001)

  1. Sum all of the integers in the array.
  2. From that, subtract the sum of the integers 1 through 1,000,000 inclusive.
  3. What's left will be your duplicated value.

This take almost no extra memory, can be done in one pass if you calculate the sums at the same time.

The disadvantage is that you need to do the entire loop to find the answer.

The advantages are simplicity, and a high probability it will in fact run faster than the other solutions.

Upvotes: 34

Luka Rahne
Luka Rahne

Reputation: 10507

Sort integer by sorting them on place they should be. If you get "collision" than you found the correct number.

space complexity O(1) (just same space that can be overwriten) time complexity less than O(n) becuse you will statisticaly found collison before getting on the end.

Upvotes: 0

Andrew Johnson
Andrew Johnson

Reputation: 13286

def singleton(array):
  return reduce(lambda x,y:x^y, array)

Upvotes: 0

mjc
mjc

Reputation:

And how about the problem of finding ALL duplicates? Can this be done in less than O(n ln n) time? (Sort & scan) (If you want to restore the original array, carry along the original index and reorder after the end, which can be done in O(n) time)

Upvotes: 0

Mike Mu
Mike Mu

Reputation: 1007

Hint: Use the property that A XOR A == 0, and 0 XOR A == A.

Upvotes: 2

Jay
Jay

Reputation: 9656

As a variant of your solution (2), you can use radix sort. No extra memory, and will run in linear time. You can argue that time is also affected by the size of numbers representation, but you have already given bounds for that: radix sort runs in time O(k n), where k is the number of digits you can sort ar each pass. That makes the whole algorithm O(7n)for sorting plus O(n) for checking the duplicated number -- which is O(8n)=O(n).

Pros:

  • No extra memory
  • O(n)

Cons:

  • Need eight O(n) passes.

Upvotes: 0

rampion
rampion

Reputation: 89123

Assuming all the numbers from 1 to 1,000,000 are in the array, the sum of all numbers from 1 to 1,000,000 is (1,000,000)*(1,000,000 + 1)/2 = 500,000 * 1,000,001 = 500,000,500,000.

So just add up all the numbers in the array, subtract 500,000,500,000, and you'll be left with the number that occured twice.

O(n) time, and O(1) memory.

If the assumption isn't true, you could try using a Bloom Filter - they can be stored much more compactly than a hash table (since they only store fact of presence), but they do run the risk of false positives. This risk can be bounded though, by our choice of how much memory to spend on the bloom filter.

We can then use the bloom filter to detect potential duplicates in O(n) time and check each candidate in O(n) time.

Upvotes: 10

T .
T .

Reputation: 4944

Rather than sorting the array and then checking, I would suggest writing an implementation of a comparison sort function that exits as soon as the dup is found, leading to no extra memory requirement (depending on the algorithm you choose, obviously) and a worst case O(nlogn) time (again, depending on the algorithm), rather than a best (and average, depending...) case O(nlogn) time.

E.g. An implementation of in-place merge sort.

http://en.wikipedia.org/wiki/Merge_sort

Upvotes: 2

Related Questions