JohnDel
JohnDel

Reputation: 2102

Ruby - Optimize the comparison of two arrays with duplicates

I have the following arrays:

A = "cheddar".split(//)  # ["c", "h", "e", "d", "d", "a", "r"]
B = "cheddaar".split(//) # ["c", "h", "e", "d", "d", "a", "a", "r"]

The A array is a subset of the B array. If the A array had another "d" element, it wouldn't be a subset.

I want to compare and find if one is a subset of the other even if they have duplicates. The A - B or the A & B doesn't capture the duplicates it just compares them and find them matched. So I wrote the following which it captures the duplicates:

B.each do |letter|
    A.delete_at(A.index(letter)) rescue ""
end

p A.empty?

Is this the best way or can it be optimized?

Upvotes: 1

Views: 641

Answers (7)

pguardiario
pguardiario

Reputation: 54984

A similar question was posted a few weeks ago and I got the accepted answer with something like:

def is_subset?(a,b)
  !a.find{|x| a.count(x) > b.count(x)}
end

Benchmark Update

require 'benchmark'

def random_char
  ('a'..'z').to_a.sample
end

A = 8.times.map{random_char}
B = 8.times.map{random_char}

def ary_subset?(a,b) # true iff a is subset of b
  a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end

Benchmark.bm do |x|
  x.report('me')     {100000.times{is_subset?(A,B)}}
  x.report('dbenhur'){100000.times{ary_subset?(A,B)}}
end

       user     system      total        real
me  0.375000   0.000000   0.375000 (  0.384022)
dbenhur  2.558000   0.000000   2.558000 (  2.550146)

Upvotes: 1

dbenhur
dbenhur

Reputation: 20408

No idea if this is actually faster than your approach, but its runtime should be O(N+M) where N,M is size of a,b. (Assuming hash lookup and insert is ammortized O(1) which isn't strictly true as hash is usually a function of key size; though in the example, all keys are single characters.) Your looping #delete_at of #index approach has substantial extra data motion and looks like it may be worst case O(N^2 * M).

def ary_subset?(a,b) # true iff a is subset of b
  a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end

The OP asked for the fastest way, so I whipped up a little micro-benchmark available at this gist.

I tested the OP's original approach (op_del), my version of using reduce counts (ct), and the variant where the count array is reused (ct_acc), and the MultiSet approach (mset), EDIT and added the very concise find of count comparisons (slow_ct) . Ran each variant against the OP's small array input example (s), larger sets of cardinality 10,000 (b), and small set against big set (sb). (Had to reduce iteration count for the big set cases by an order of magnitude to get _slow_ct_ to complete in reasonable time.) Results here:

                     user     system      total        real
s_op_del         1.850000   0.000000   1.850000 (  1.853931)
s_ct             2.260000   0.000000   2.260000 (  2.264028)
s_ct_acc         1.700000   0.000000   1.700000 (  1.706881)
s_mset           5.460000   0.000000   5.460000 (  5.484833)
s_slow_ct        1.720000   0.000000   1.720000 (  1.731367)
b_op_del         0.310000   0.000000   0.310000 (  0.312804)
b_ct             0.120000   0.000000   0.120000 (  0.123329)
b_ct_acc         0.100000   0.000000   0.100000 (  0.101532)
b_mset           0.310000   0.000000   0.310000 (  0.319697)
b_slow_ct       82.910000   0.000000  82.910000 ( 83.013747)
sb_op_del        0.710000   0.020000   0.730000 (  0.734022)
sb_ct            0.050000   0.000000   0.050000 (  0.054416)
sb_ct_acc        0.040000   0.000000   0.040000 (  0.059032)
sb_mset          0.110000   0.000000   0.110000 (  0.117027)
sb_slow_ct       0.010000   0.000000   0.010000 (  0.011287)

The reduce count, reusing the count accumulator is the clear winner. Multiset was disappointingly slow.

Upvotes: 3

danieltahara
danieltahara

Reputation: 4993

Definitely want to take advantage of enumerators here -- the best way to go about this is to probably use group_by and compare the number of times each letter appears:

def subset?(a, b)
   a = a.each_char.group_by { |char| char }
   b = b.each_char.group_by { |char| char }
   a.each_key.all? do |letter|
     b[letter] && a[letter].size < b[letter].size
   end
end

So if we count hash lookups as O(1) operations, then this is an O(m + n) algorithm

Upvotes: 2

Harish Shetty
Harish Shetty

Reputation: 64363

Try this:

class String
  def subset_of?(str)
    e2 = str.each_char
    c2 = c2p = nil
    each_char do |c1|
      c2p, c2 = c2, e2.next
      next if c2 == c1    
      c2p, c2 = c2, e2.next until (c2 != c2p) # move until we exclude duplicates
      return false if c2 != c1
    end
    true
  rescue StopIteration
    false
  end
end

Test the function:

>> "chedddar".subset_of?("cheddaaaaaar")
=> false
>> "cheddar".subset_of?("cheddaaaaaar")
=> true
>> "cheddar".subset_of?("cheddaaaaaarkkkk")
=> true
>> "chedddar".subset_of?("cheddar")
=> false
>> "chedddar".subset_of?("chedd")
=> false

Edit 1

Updated the solution based on the additional information provided.

class String
  def subset_of?(str)
    h1, h2 = [self, str].map {|s| s.each_char.reduce(Hash.new(0)){|h, c| h[c] += 1; h}}
    h1.all?{|c, k| h2[c] >= k}
  end
end

Upvotes: 1

Tanzeeb Khalili
Tanzeeb Khalili

Reputation: 7344

How about removing the dupes first?

(A.uniq - B.uniq).empty?

Upvotes: -1

davidrac
davidrac

Reputation: 10738

If I remember correctly, your solution is O(n^2) This one is a bit cumbersome, but more efficient, at least for large inputs (this is O(n)). It might need some more work...

def is_subset?(a, b)
    letters = Hash.new(0)
    a.each_char{|x| letters[x] += 1}
    b.each_char{|x| letters[x] -= 1}
    letters.values.all?{|v| v >= 0 }
end

Edit: a bit more efficient:

def is_subset?(a, b)
    letters = Hash.new(0)
    a.each_char{|x| letters[x] += 1}
    b.each_char.all?{|x| (letters[x] -= 1) > 0}
end

Upvotes: 2

Mark Thomas
Mark Thomas

Reputation: 37517

If I understand the requirements correctly, you can use the multiset gem.

require 'multiset'
a = Multiset.new "cheddar".split(//)
b = Multiset.new "cheddaar".split(//)

a.subset? b #=> true

Upvotes: 2

Related Questions