Reputation: 5347

what is effective way to find duplicate between two sets?

I have set A and B

set A has 100 million number(each number is 64bit)

set B has 100 million number(each number is 64bit)

two sets are same size.

data is all random and not sorted.

what algorithm would you recommend to find the duplicate number between two sets?

(I can utilize about 4G memory and 100~200 Gb of disk space)

thank you in advance.

Upvotes: 0

Answers (4)

rici

Reputation: 241701

Probably the cheapest in terms of execution time (but not, unfortunately, programming time) is to put the elements from A into an open-addressed hash table and then look each element of B up in the hash table. If you can come up with a reasonable hash function (more below) then you can use simple linear hashing with a load factor of about 60%, which means your table will occupy 10⁸ * (1/.6) * 8 bytes, or about 1.3 GB. (I don't know of a language which offers open-addressed hash tables in the standard library; C++'s unordered_sets are implemented with bucket chains, which would be only slightly more overhead if the individual elements weren't individual storage allocations. A good allocator might make this feasible.)

Fortunately, open-addressed linear-probed hash-tables are pretty easy to write, particularly if you don't need to deal with deleting elements. You only have two issues:

You need to reserve some value which means "unoccupied".
You need a good hash-function. Or at least a reasonable one.

If your data is truly randomly distributed in the 64-bit space, then hashing is simple; you just need to reduce the data to the desired size. A simple way of doing that would be to use the modulo operator, which should work well even if the data isn't totally randomly distributed provided you arrange to make the table size a prime (166666783 would be around the right size for 60% load factor with 100 million elements).

Finding a value which means "unoccupied" could be trickier. It's possible that you already know that one value is impossible (possibly the value 0). If not, you could just choose a random 64-bit number; the odds are pretty good that it's not present in your dataset, but if it is you have an easy backup: don't put it into the hash table, and check every value of B against it.

Pseudo-C++-code, based on the above description, including the "no value" hack mentioned:

class HundredMillionSet {
  std::vector<uint64_t> h_;
  const size_t size_
  const uint64_t no_value_;
  bool has_no_value_;
  HundredMillionSet(size_t size, uint64_t no_value)
      : h_(size, no_value), size_(size), no_value_(no_value), has_no_value_(false) {}

  void insert(uint64_t v) {
    if (v == no_value_) { has_no_value_ = true; }
    else {
      size_t i = v % size_;
      while (h_[i] != no_value_) {
        if (++i == size_) i = 0;
      }
      h_[i] = v;
    }
  }

  bool check(uint64_t v) {
    if (v == no_value_) return has_no_value_;
    size_t i = v % size_;
    while (h_[i] != v && h_[i] != no_value_) {
      if (++i == size_) i = 0;
    }
    return h_[i] == v;
  }
};

Upvotes: 2

Dave

Reputation: 46249

Since your entire dataset will fit (easily) into your RAM, you don't need to do anything clever with ordering it, and don't need to use the disk space (beyond loading the data in the first place).

I'll assume each element can exist at most once in each list.

Dumb (brute-force) approach, O(n^2):

foreach a in A (this could be as you read it from disk)
    foreach b in B
        if a is b
            increase count
            break out of inner loop

Pre-sorted approach: (2*n*log(n)+n), so O(n*log(n))

sort A
sort B
B_ind = 0
foreach a in A
    foreach b in B from B_ind
        if a is b
            increase count
            B_ind = current b index + 1
            break out of inner loop
        else if b > a
            B_ind = current b index
            break out of inner loop

I'd recommend the former, but with parallel processing. It's easy to break up the outer loop into chunks to split between processors / workstations. The latter can also be parallelised somewhat (perform both sorts simultaneously) but nowhere near as much.

Also in the case of the former, you could get some performance boost from splitting the b loop into cache-sized chunks. i.e. check A[0] against B[0...1023] (if your cache can hold 1024 items), then check A[1], ... A[final], then check A[0] against B[1024...2047], etc.

Upvotes: 0

smttsp

Reputation: 4191

Let say, first arr is arr1 and second is arr2;

sort arr1//max O(n*log_n)
for(int i = 0; i < arr2.length; i++){ //n
    binarySearch(arr1, arr2[i]) //log_n
}

In total O(n logn)

Upvotes: 2

kyticka

Reputation: 604

64bit = 8byte. 2*8*100,000,000byte = 1.6GB => you can hold your numbers in RAM (you will need maybe twice as much for node structures...). I would go for balanced binary tree (just search wiki for AVL, AB... trees). Just add numbers from one set to one tree, from another set to another tree and do intersection.

You might want just to sort two arrays and intersect them. It should be the easiest solution.

In case you could not handle all numbers in memory, use database (MySQL, PostgreSQL...). Two sorted tables and intersection. It should be quite fast and most important easy to implement.

Upvotes: 0

what is effective way to find duplicate between two sets?

Answers (4)

Related Questions