Atishay
Atishay

Reputation: 1056

Find duplicates in an array, without using any extra space

Given an array of n integer elements how will you find whether there are duplicates in the array in O(n) time without using any extra space.

With extra space it means extra space of order O(n).

Does the Xor operator help in any way.

Upvotes: 28

Views: 54304

Answers (14)

stanm
stanm

Reputation: 3321

To kickstart our visual thinking, let's have an example array: 50 100 150 -2 -1 0 1 2 3 4. As you can surely tell, it doesn't have duplicates, so our algorithm should output FALSE. Also, its length is 10.

Step A: Count in O(N) time

Let's ignore the extra memory constraint for now (actually, violate it really badly, by assuming we can have O(\inf) additional memory :) and save in a fictional infinite array (it is also doubly-infinite, since it allows negative indeces too) the counts for each integer. For our input, this array would look like this:

...000001111111000...00100...00100...001000000...
        ^              ^               ^
   [index  -2]     [index  50]     [index 150]

If any of the elements of the array is greater than 1, then we have a duplicate and the algorithm should return TRUE.

Step B: Map -inf..inf to 0..N in O(N) time

Let's assume we have a map f(x):-inf..inf -> 0..N that can compress our infinite array to an array of size N, and furthermore do it in O(N) time. This is what hashing ideally does. Note that we don't care about maintaining the order of the array, as we only care whether it has elements that are above 1. Thus, we can combine these two steps, and eliminate the need of inifinite memory - yay! We are still using an additional O(N) memory (in fact, exactly N counts) to keep the count values. The next step would get rid of that.

Step C: Using first element as a switch

Before I explain this step, notice that we don't really need to store any counts greater than 1. The first time we want to increase a counter and we notice it already has the value of 1 we know we found a duplicate! So 1 bit of memory per counter is enough. This reduces the required memory to O(lg(N)), but we don't really care about this, as it is not good enough. The important part is that 1 bit of memory per counter is enough.

We are now going to exploit the fact that we can modify our input array. We go over the array and xor all the elements with the value of the first element. If the result is less than the value before the operation we change it to that result. We also store the first element seperately as sw at an additional O(1) memory cost.

Now, we can use the stored first element sw and the transformed array to code in the counts from the counting step (steps A + B) in the following manner: considering element with index k of the A, if A[f(A[k])] < A[f(A[k])] xor sw then the count is zero which means the element we are considering - A[k] - has not been seen before, so it we change A[f(A[k])] to A[f(A[k])] xor sw. If, otherwise, A[f(A[k])] > A[f(A[k])] xor sw then the count is one which means the element we are considering - A[k] - already has been seen before, so it is a duplicate.

Assuming the map:

f(-2 xr 50) -> 0
f(-1 xr 50) -> 1
f(0)        -> 2
f(1)        -> 3
f(2)        -> 4
f(3)        -> 5
f(4)        -> 6
f(86)       -> 7
f(150)      -> 8
f(1337)     -> 9

and after performing the steps in the following order: step c; step a+b the input array looks like this:

50(0) 100(86) 150(164) -2(-2 xr 50) -1(-1 xr 50) 0(50) 1(51) 2(48) 3(49) 4(54) [intermediate state, not stored in memory]
0     86      150      -2 xr 50     -1 xr 50     0     1     2     3     4     [state after step c]
0     86     *164*     -2 xr 50     -1 xr 50     0     1     2     3     4     [counted element 0]
0     86      164      -2 xr 50     -1 xr 50     0     1    *48*   3     4     [counted element 1]
0     86      164      -2 xr 50     -1 xr 50     0     1     48   *49*   4     [counted element 2]
*50*  86      164      -2 xr 50     -1 xr 50     0     1     48    49    4     [counted element 3]
50   *100*    164      -2 xr 50     -1 xr 50     0     1     48    49    4     [counted element 4]
50    100    !164!     -2 xr 50     -1 xr 50     0     1     48    49    4     [counted element 5]

Trying to count element with index 5 which is 0 we see that there already was a 0 in the array! (because A[f(A[5])] is 164 which is greater than 164 xr 50) So we output TRUE and the algorithm ends.

Moral of the story

If we are not allowed enough memory x time we are bound to forget something and make a mistake.

Sorry

Unfortunately, we don't have a perfect hash function, and we cannot just create memory out of thin air, so a traditional approach would not work under the required constraints. The algorithm that the answer given by the OP points to might be modified so that it allows using numbers which interpreted as array indeces would fall outside the bounds of the array, given a perfect hash function. But even then, it has to be invented how to use it to detect duplication, rather than find one guaraneed to exist...

Upvotes: 0

Aditya
Aditya

Reputation: 1068

you can use swap sort to do all the necessary swaps and find the duplicate elements eventually. You can also scale this solution if you want to find more than one duplicate elements in the same array. One caveat with this solution is that the range of the numbers in the array has to be from 1 to n.

below is the sample code for this. input -- > [1,3,4,2,2] here the duplicate element is 2. You can run this code for the above input.

    public int findDuplicate(int[] nums) {

    int i = 0;
    while(i < nums.length){
        if(nums[i] != nums[nums[i] - 1]){
            swap(i, nums[i] - 1, nums);
        }else{
            i++;   
        }
    }
    for(int j=0; j < nums.length; j++){
        if(nums[j] != j + 1){
            return nums[j];
        }
    }
    return -1;
}
    public static void swap(int i, int j, int[] nums){
        int temp = nums[i];
        nums[i] = nums[j];
        nums[j] = temp;
}

the main thing that has to be kept in mind is that the index i should contain i + 1 number. eg: the index 0 should have element 1 i.e. arr[0] = 1, index 1 should have element 2 i.e. arr[1] = 2 so on so fourth.

This solution doesn't require any extra space and the time complexity is O(n)

Upvotes: -1

Emanuele
Emanuele

Reputation: 1

I would like to put on the table the following solution. I hope it could help.

assert(A.size() > 1);

int m = A[0];
for(unsigned int i = 1; i < A.size(); ++i) {    // O(n)
    m ^= A[i];
    m ~= m;
}
return m;

It is based on: https://codesays.com/2015/solution-to-odd-occurrences-in-array-by-codility/

Cheers

Upvotes: 0

smaiakov
smaiakov

Reputation: 480

Clean example to determinate duplicates with O(n) by time and O(1) by space:

public class DuplicateDetermineAlgorithm {
    public static boolean isContainsDuplicate(int[] array) {
        if (array == null) {
            throw new IllegalArgumentException("Input array can not be null");
        }
        if (array.length < 2) {
            return false;
        }

        for (int i = 0; i < array.length; i++) {
            int pointer = convertToPositive(array[i]) - 1;
            if (array[pointer] > 0) {
                array[pointer] = changeSign(array[pointer]);
            } else {
                return true;
            }
        }
        return false;
    }

    private static int convertToPositive(int value) {
        return value < 0 ? changeSign(value) : value;
    }

    private static int changeSign(int value) {
        return -1 * value;
    }
}

Upvotes: 1

AndyG
AndyG

Reputation: 41230

In-place Radix Sort followed by Linear Scan

In place radix sort algorithm

Depending on what you actually consider the time complexity of a Radix sort to be, this solution is O(N) time, although my personal opinion is not so. I think that if you don't make the linear-time assumption on the integer sort, then the problem is unsolvable.

Due to the fact that the sort is in-place, there's only O(1) additional storage required.

Code is all C++11

Step 1: Radix Sort in place

template<typename T, typename std::enable_if<std::is_integral<T>::value>::type* = nullptr>
void RecurseOnRadixSort(std::vector<T>& myArray, T mask, int zerosEnd, int onesBegin)
{
    if (zerosEnd+1 >= onesBegin-1 || mask == 0) 
        return;

    int zerosEnd2 = zerosEnd;
    int onesBegin2 = onesBegin;
    while(zerosEnd2+1 <= onesBegin2-1)
    {
        // swap ones to the right
        if ((myArray[zerosEnd2+1] & mask) != 0)
        {
            std::swap(myArray[zerosEnd2+1], myArray[onesBegin2-1]);
            --onesBegin2;
        }
        else
            ++zerosEnd2;
    }

    mask >>= 1;

    //recurse on lhs
    RecurseOnRadixSort(myArray, mask, zerosEnd, zerosEnd2+1);

    //recurse on rhs
    RecurseOnRadixSort(myArray, mask, onesBegin2-1, onesBegin);
}

template <typename T, typename std::enable_if<std::is_integral<T>::value>::type* = nullptr>
void InPlaceRadixSort(std::vector<T>& myArray)
{
    int zerosEnd = -1;
    int onesBegin = static_cast<int>(myArray.size());
    T mask = static_cast<T>(1) << sizeof(T)*8-1;
    while(zerosEnd+1 <= onesBegin-1)
    {
        if ( (myArray[zerosEnd+1] & mask) != 0)
        {
            std::swap(myArray[zerosEnd+1], myArray[onesBegin-1]);
            --onesBegin;
        }
        else
            ++zerosEnd;
    }

    mask = static_cast<T>(1) << sizeof(T)*8-2; // need to reassign in case of signed datatype
    //recurse on lhs
    RecurseOnRadixSort(myArray, mask, -1, zerosEnd+1);
    //recurse on rhs
    RecurseOnRadixSort(myArray, mask, onesBegin-1, static_cast<int>(myArray.size()));

    // swap negatives to the front
    auto iterSmallest = std::min_element(myArray.begin(), myArray.end());
    if (*iterSmallest < 0)
    {
        std::reverse(myArray.begin(), myArray.end());
        iterSmallest = std::min_element(myArray.begin(), myArray.end());
        std::reverse(myArray.begin(), iterSmallest+1);
        std::reverse(iterSmallest+1, myArray.end());
    }
}

Step 2: Linear scan for duplicate elements

for (size_t i=0, j=1; j<myArray.size(); ++i,++j)
{
    if (myArray[i] == myArray[j])
    {
        std::cout << "Found duplicate element " << myArray[i];
    }
}

Complete Code

#include <iostream>
#include <string>
#include <vector>
#include <iostream>
#include <vector>
#include <algorithm>
#include <ctime>
#include <type_traits>
using namespace std;
#define N 10

template <typename T>
void PrintArray(const std::vector<T>& myArray)
{
    for (auto&& element : myArray)
    {
        std::cout << element << std::endl;
    }
}

template<typename T, typename std::enable_if<std::is_integral<T>::value>::type* = nullptr>
void RecurseOnRadixSort(std::vector<T>& myArray, T mask, int zerosEnd, int onesBegin)
{
    if (zerosEnd+1 >= onesBegin-1 || mask == 0) 
        return;

    int zerosEnd2 = zerosEnd;
    int onesBegin2 = onesBegin;
    while(zerosEnd2+1 <= onesBegin2-1)
    {
        // swap ones to the right
        if ((myArray[zerosEnd2+1] & mask) != 0)
        {
            std::swap(myArray[zerosEnd2+1], myArray[onesBegin2-1]);
            --onesBegin2;
        }
        else
            ++zerosEnd2;
    }

    mask >>= 1;

    //recurse on lhs
    RecurseOnRadixSort(myArray, mask, zerosEnd, zerosEnd2+1);

    //recurse on rhs
    RecurseOnRadixSort(myArray, mask, onesBegin2-1, onesBegin);
}

template <typename T, typename std::enable_if<std::is_integral<T>::value>::type* = nullptr>
void InPlaceRadixSort(std::vector<T>& myArray)
{
    int zerosEnd = -1;
    int onesBegin = static_cast<int>(myArray.size());
    T mask = static_cast<T>(1) << sizeof(T)*8-1;
    while(zerosEnd+1 <= onesBegin-1)
    {
        if ( (myArray[zerosEnd+1] & mask) != 0)
        {
            std::swap(myArray[zerosEnd+1], myArray[onesBegin-1]);
            --onesBegin;
        }
        else
            ++zerosEnd;
    }

    mask = static_cast<T>(1) << sizeof(T)*8-2; // need to reassign in case of signed datatype
    //recurse on lhs
    RecurseOnRadixSort(myArray, mask, -1, zerosEnd+1);
    //recurse on rhs
    RecurseOnRadixSort(myArray, mask, onesBegin-1, static_cast<int>(myArray.size()));

    // swap negatives to the front
    auto iterSmallest = std::min_element(myArray.begin(), myArray.end());
    if (*iterSmallest < 0)
    {
        std::reverse(myArray.begin(), myArray.end());
        iterSmallest = std::min_element(myArray.begin(), myArray.end());
        std::reverse(myArray.begin(), iterSmallest+1);
        std::reverse(iterSmallest+1, myArray.end());
    }
}

int main() {
    srand(time(NULL));
    std::vector<int> myArray(N);
    for (size_t i=0;i<myArray.size();++i)
    {
        myArray[i] = rand() % 100 * (rand() % 2 == 1?-1:1);
    }

    std::cout << "Vector before radix sort: " << std::endl;
    PrintArray(myArray);
    InPlaceRadixSort(myArray);
    std::cout << "Vector after radix sort: " << std::endl;
    PrintArray(myArray);

    for (size_t i=0, j=1; j<myArray.size(); ++i,++j)
    {
        if (myArray[i] == myArray[j])
        {
            std::cout << "Found duplicate element " << myArray[i];
        }
    }
    return 0;
}

Live Demo

Upvotes: 7

amit
amit

Reputation: 178521

If there is no additional information, this question seems to be unsolveable, as this is the Element Distinctness Problem, which is unsolveable with the restrictions you provided, in the required time.

you can allow:

(1) more memory and use a hashtable / hashset and meet the O(n) time criteria. [iterate the array, check if an element is in the hash table, if it is you have dupes, otherwise - insert the element into the table and continue].

(2) more time, sort the array [O(nlogn)] and meet the sub-linear space criteria. [After sorting, iterate over the array, and for each a[i] , a[i+1] , check if they are identical. If you did not find an identical pair, you have no dupes]

EDIT: The proof for this claim is a bit lengthy, and needs mathematical notation that are not supported here (sidenote: we really need tex support), but the idea is if we model our problem as an Algebraic Computation Tree (which is a fair assumption when no hashing is allowed, and constant space at out disposal), then, Ben Or proved in his article Lower Bounds For Algebraic Computation Trees (1983) (published in prestiged ACM), that element distinctness is Omega(nlogn) problem under this model. Lubiw showed that the same conclusion also applies when limiting ourselves to integers in 1991: A Lower Bound for the Integer Element Distinctness Problem, but these articles conclude that under the algebraic tree computation model - Integer Distinctness Problem is Omega(nlogn) Problem.

Upvotes: 29

Slizzered
Slizzered

Reputation: 899

For the general case, this problem doesn't seem to have a solution due to the strong complexity constraints and the unrestrained input.

It is clear, that you need at least N steps to even SEE all the input. So it can not be faster than O(n).

Now, to make sure to spot every possible duplicate, you have different possibilities:

  • Compare each number with every other number, this doesn't require much additional space, but takes O(n^2) time`
  • Do the comparison in a smarter way, by swapping the integers around in the available space. This allows to "store information" in the sequence itself. Actually, comparing all numbers with each other is usually done in sorting algorithms. The fastest known sorting algorithms that don't require additional space need O(n log n) time. Wikipedia has a rather lengthy writeup with lots of sources. So you can never get your time requirement right that way. (some comparison chart of known sorting algorithms)
  • You could do some book-keeping with a hash-map that might allow you to take only linear time O(n), but that bookkeeping needs to be stored somewhere. Otherwise, you will simply "forget" which numbers you have already seen. Unfortunately, the bookkeeping will require more space if your input increases because you have so many different numbers to remember. So it is impossible to have the same fixed amount of memory and compare arbitrarily long input sequences. Therefore, you would have to violate the constant space O(1).

As @Atishay points out in his answer, there can be a solution if you have a very restricted input. Here it is required that you have an array of size n and the possible values are only in the range [0,n-2]. This requirement guarantees that there MUST be a duplicate somewhere because there are less different values than elements in the array. With that knowledge and the very specific range of values, you can do it. But this uses very narrow assumptions and does not solve the general problem stated in the question.

Edit

As clarified in the comments, there is a proven lower bound for the time complexity of comparison-based sorting algorithms. For reference, see here:

Upvotes: 2

Alex
Alex

Reputation: 13244

This solution is based upon one that removes duplicates from an array by @dsimcha, as can be found here.

It performs an in-place swapping algorithm, with value hashes used to swap positions. Note that this destroys the original array content to some extent. But there was no requirement in OP's question that forbade that.

public static class DupFinder
{
    public static bool HasDups(int[] array, ref int nEvals)
    {
        nEvals = 0;
        return DupFinder.FindInPlace(array, 0, ref nEvals);
    }

    private static bool FindInPlace(int[] array, int start, ref int nEvals)
    {
        if (array.Length - start < 2)
            return false;

        var sentinel = array[start];
        var offset = start + 1;
        var len = array.Length - offset;
        for (var ndx = 0; ndx < len; nEvals++)
        {
            var cur = array[offset + ndx];
            if (cur == sentinel)
            {
                ndx++;
                continue;
            }

            var hash = cur % len;
            if (ndx == hash)
            {
                ndx++;
                continue;
            }

            var at_hash = array[offset + hash];
            if (cur == at_hash)
            {
                array[offset + ndx] = sentinel;
                ndx++;
                continue;
            }

            if (at_hash == sentinel)
            {
                Swap(array, offset, ndx, hash);
                ndx++;
                continue;
            }

            var hash_hash = at_hash % len;
            if (hash_hash != hash)
            {
                Swap(array, offset, ndx, hash);
                if (hash < ndx)
                    ndx++;
            }
            else
            {
                ndx++;
            }
        }

        var swapPos = 0;
        for (var i = 0; i < len; i++, nEvals++)
        {
            var cur = array[offset + i];
            if (cur != sentinel && i == (cur % len))
                Swap(array, offset, i, swapPos++);
        }

        for (var i = swapPos; i < len; nEvals++)
        {
            var cur = array[offset + i];
            if (cur == sentinel)
                return true; // got dups.
            else
                i++;
        }

        // Let's assume C# supports tail recursion ;-)
        // Then => look ma, O(1) extra storage space.
        return FindInPlace(array, offset + swapPos, ref nEvals);
    }

    private static void Swap(int[] array, int offset, int first, int second)
    {
        var tmp = array[offset + first];
        array[offset + first] = array[offset + second];
        array[offset + second] = tmp;
    }
}

Thus, if we assume for a moment that c# supports tail recursion and we don't count the used stack frames as extra space, it has O(1) space requirements.

The author mentions it to be of O(N)-ish time complexity. The (limited) tests (as opposed to a computational complexity analysis) that I've performed would indicate it is closer to O(N log N).

Array Size   Dup Position    #Evals
12           7               26
12           -               35
100,000      80,000          279,997
100,000      -               453,441

Upvotes: 2

user846316
user846316

Reputation: 6331

public static void getDuplicatesElements (Integer arr[]){

    //Status array to track the elements if they are already considered
    boolean status[] = new boolean [arr.length];

    //Flag to mark the element found its duplicate
    boolean dupFlag = false;

    //Output string
    String  output = "";

    //Count of duplicate elements found
    int count = 0;

    //Initialize status array with all false i.e. no duplicates
    for (int i = 0; i < arr.length; i++)
    {
        status[i] = false;
    }

    //first loop to check every element
    for (int i = 0; i < arr.length - 1; i++)
    {
        //Initialize every element to no duplicate
        dupFlag = false;

        //Check if this element is not already found duplicate, if not, check now.
        if (!status[i]){
            for (int j = i+1; j <  arr.length; j++){
                if (arr[i] == arr[j]){
                    dupFlag = true;
                    status[j] = true;
                }
            }
        }

        if (dupFlag){
            output = output + " " + arr[i];
            count++;
        }
    }

    System.out.println("Duplicate elements: " + output );
    System.out.println("Count: " + count );

}

Upvotes: 0

Baz1nga
Baz1nga

Reputation: 15579

an implementation using a single int as a temporary variable.. this is using bit vectors/

 public static boolean isUniqueChars(String str) {
    int checker = 0;
    for (int i = 0; i < str.length(); ++i) {
     int val = str.charAt(i) - ‘a’;
     if ((checker & (1 << val)) > 0) return false;
     checker |= (1 << val);
    }
    return true;
  }

or my prev implementation of O(n^2) without using any temp variable

public static bool isDuplicate(char[] str) {
    if (str == null) return false;
    int len = str.length;
    if (len < 2) return false;

    for (int i = 1; i < len; ++i) {
      for (int j = 0; j < len; ++j) {
        if (str[i] == str[j]) return true;
      }
    }
    return false;
  }

Upvotes: 1

user2808380
user2808380

Reputation: 5

import java.util.HashSet;
import java.util.Set;


public class FindDups {
public static void main(String[] args) {
    int a[]={1,2,3,3,4};

    Set<Integer> s=new HashSet<Integer>();
    for(int i=0;i<a.length;i++)
    {
    if(!s.add(a[i]))
        System.out.println("at index"+ i+" "+a[i]+"is duplicate");  
    }
    for(int i:s)
    {
        System.out.println(i);
    }
}
}

Upvotes: -2

Abhinav Jain
Abhinav Jain

Reputation: 151

Here is solution with O(n) time usage and O(1) space usage!

Traverse the array. Do following for every index i of A[].
{
    check for sign of A[abs(A[i])] ;
    if positive then        make it negative by   A[abs(A[i])]=-A[abs(A[i])];
    else  // i.e., A[abs(A[i])] is negative
    this   element (ith element of list) is a repetition
}

Credits: Method 5 Geek for Geeks

Upvotes: 2

Atishay
Atishay

Reputation: 1056

Here's an interesting solution to this problem with a single constraints that the elements should range between 0 to n-2(inclusive) where n is the number of elements.

This works in O(n) time with an O(1) space complexity.

Upvotes: 4

phunctor
phunctor

Reputation: 609

Bloom filter is a space efficient hashset with a tunable false positive rate. The false positive possibility means you have to go back and check for a real duplicate when you get a hit from the BF, introducing an N^2 term - but the coefficient is ~exp(-(extra space used for filter)). This produces an interesting space vs time tradeoff space.

I don't have a proof the question as posed is insoluble, but in general "here's an interesting tradeoff space" is a good answer to an insoluble problem.

Upvotes: 2

Related Questions