Reputation: 2102
I have the following arrays:
A = "cheddar".split(//) # ["c", "h", "e", "d", "d", "a", "r"]
B = "cheddaar".split(//) # ["c", "h", "e", "d", "d", "a", "a", "r"]
The A array is a subset of the B array. If the A array had another "d" element, it wouldn't be a subset.
I want to compare and find if one is a subset of the other even if they have duplicates. The A - B or the A & B doesn't capture the duplicates it just compares them and find them matched. So I wrote the following which it captures the duplicates:
B.each do |letter|
A.delete_at(A.index(letter)) rescue ""
end
p A.empty?
Is this the best way or can it be optimized?
Upvotes: 1
Views: 641
Reputation: 54984
A similar question was posted a few weeks ago and I got the accepted answer with something like:
def is_subset?(a,b)
!a.find{|x| a.count(x) > b.count(x)}
end
Benchmark Update
require 'benchmark'
def random_char
('a'..'z').to_a.sample
end
A = 8.times.map{random_char}
B = 8.times.map{random_char}
def ary_subset?(a,b) # true iff a is subset of b
a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end
Benchmark.bm do |x|
x.report('me') {100000.times{is_subset?(A,B)}}
x.report('dbenhur'){100000.times{ary_subset?(A,B)}}
end
user system total real
me 0.375000 0.000000 0.375000 ( 0.384022)
dbenhur 2.558000 0.000000 2.558000 ( 2.550146)
Upvotes: 1
Reputation: 20408
No idea if this is actually faster than your approach, but its runtime should be O(N+M) where N,M is size of a,b. (Assuming hash lookup and insert is ammortized O(1) which isn't strictly true as hash is usually a function of key size; though in the example, all keys are single characters.) Your looping #delete_at of #index approach has substantial extra data motion and looks like it may be worst case O(N^2 * M).
def ary_subset?(a,b) # true iff a is subset of b
a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end
The OP asked for the fastest way, so I whipped up a little micro-benchmark available at this gist.
I tested the OP's original approach (op_del), my version of using reduce counts (ct), and the variant where the count array is reused (ct_acc), and the MultiSet approach (mset), EDIT and added the very concise find of count comparisons (slow_ct) . Ran each variant against the OP's small array input example (s), larger sets of cardinality 10,000 (b), and small set against big set (sb). (Had to reduce iteration count for the big set cases by an order of magnitude to get _slow_ct_ to complete in reasonable time.) Results here:
user system total real
s_op_del 1.850000 0.000000 1.850000 ( 1.853931)
s_ct 2.260000 0.000000 2.260000 ( 2.264028)
s_ct_acc 1.700000 0.000000 1.700000 ( 1.706881)
s_mset 5.460000 0.000000 5.460000 ( 5.484833)
s_slow_ct 1.720000 0.000000 1.720000 ( 1.731367)
b_op_del 0.310000 0.000000 0.310000 ( 0.312804)
b_ct 0.120000 0.000000 0.120000 ( 0.123329)
b_ct_acc 0.100000 0.000000 0.100000 ( 0.101532)
b_mset 0.310000 0.000000 0.310000 ( 0.319697)
b_slow_ct 82.910000 0.000000 82.910000 ( 83.013747)
sb_op_del 0.710000 0.020000 0.730000 ( 0.734022)
sb_ct 0.050000 0.000000 0.050000 ( 0.054416)
sb_ct_acc 0.040000 0.000000 0.040000 ( 0.059032)
sb_mset 0.110000 0.000000 0.110000 ( 0.117027)
sb_slow_ct 0.010000 0.000000 0.010000 ( 0.011287)
The reduce count, reusing the count accumulator is the clear winner. Multiset was disappointingly slow.
Upvotes: 3
Reputation: 4993
Definitely want to take advantage of enumerators here -- the best way to go about this is to probably use group_by and compare the number of times each letter appears:
def subset?(a, b)
a = a.each_char.group_by { |char| char }
b = b.each_char.group_by { |char| char }
a.each_key.all? do |letter|
b[letter] && a[letter].size < b[letter].size
end
end
So if we count hash lookups as O(1) operations, then this is an O(m + n) algorithm
Upvotes: 2
Reputation: 64363
Try this:
class String
def subset_of?(str)
e2 = str.each_char
c2 = c2p = nil
each_char do |c1|
c2p, c2 = c2, e2.next
next if c2 == c1
c2p, c2 = c2, e2.next until (c2 != c2p) # move until we exclude duplicates
return false if c2 != c1
end
true
rescue StopIteration
false
end
end
Test the function:
>> "chedddar".subset_of?("cheddaaaaaar")
=> false
>> "cheddar".subset_of?("cheddaaaaaar")
=> true
>> "cheddar".subset_of?("cheddaaaaaarkkkk")
=> true
>> "chedddar".subset_of?("cheddar")
=> false
>> "chedddar".subset_of?("chedd")
=> false
Edit 1
Updated the solution based on the additional information provided.
class String
def subset_of?(str)
h1, h2 = [self, str].map {|s| s.each_char.reduce(Hash.new(0)){|h, c| h[c] += 1; h}}
h1.all?{|c, k| h2[c] >= k}
end
end
Upvotes: 1
Reputation: 7344
How about removing the dupes first?
(A.uniq - B.uniq).empty?
Upvotes: -1
Reputation: 10738
If I remember correctly, your solution is O(n^2) This one is a bit cumbersome, but more efficient, at least for large inputs (this is O(n)). It might need some more work...
def is_subset?(a, b)
letters = Hash.new(0)
a.each_char{|x| letters[x] += 1}
b.each_char{|x| letters[x] -= 1}
letters.values.all?{|v| v >= 0 }
end
Edit: a bit more efficient:
def is_subset?(a, b)
letters = Hash.new(0)
a.each_char{|x| letters[x] += 1}
b.each_char.all?{|x| (letters[x] -= 1) > 0}
end
Upvotes: 2
Reputation: 37517
If I understand the requirements correctly, you can use the multiset gem.
require 'multiset'
a = Multiset.new "cheddar".split(//)
b = Multiset.new "cheddaar".split(//)
a.subset? b #=> true
Upvotes: 2