Reputation: 702
I am going through my system dictionary and looking for words that are, according to a strict definition, neither subsets nor supersets of any other word.
The implementation below does not work, but if it did, it would be pretty efficient, I think. How do I iterate through the array and also remove items from that same array during iteration?
def collect_dead_words
result = @file #the words in my system dictionary, as an array
wg = WordGame.new # the class that "knows" the find_subset_words &
# find_superset_words methods
result.each do |value|
wg.word = value
supersets = wg.find_superset_words.values.flatten
subsets = wg.find_subset_words.values.flatten
result.delete(value) unless (matches.empty? && subsets.empty?)
result.reject! { |cand| supersets.include? cand }
result.reject! { |cand| subsets.include? cand }
end
result
end
Note: find_superset_words
and find_subset_words
both return hashes, hence the values.flatten
bit
Upvotes: 1
Views: 106
Reputation: 110755
The problem
As I understand, string s1
is a subset of string s2
if s1 == s2
after zero or more characters are removed from s2
; that is, if there exists a mapping m
of the indices of s1
such that1:
i
of s1
, s1[i] = s2[m(i)]
; andi < j
then m(i) < m(j)
.Further s2
is a superset of s1
if and only if s1
is a subset of s2
.
Note that for s1
to be a subset of s2
, s1.size <= s2.size
must be true.
For example:
m(0) = 3
and m(1) = 2
, but m(0) < m(1)
is false;Algorithm
Subset (and hence superset) is a transitive relation, which permit significant algorithmic efficiencies. By this I mean that if s1
is a subset of s2
and s2
is a subset of s3
, then s1
is a subset of s3
.
I will proceed as follows:
neither_sub_nor_sup
and longest_sups
and an empty array subs_and_sups
.w
to neither_sub_nor_sup
, where w
is longest word in the dictionary.w
in the dictionary (longest to shortest), perform the following operations:
u
of neither_sub_nor_sup
determine if w
is a subset of u
. If it is, move u
from neither_sub_nor_sup
to longest_sups
and append u
to subs_and_sups
.neither_sub_nor_sup
to longest_sups
, append w
to subs_and_sups
; else add w
to neither_sub_nor_sup
.subs_and_sups
.Code
require 'set'
def identify_subs_and_sups(dict)
neither_sub_nor_sup, longest_sups = Set.new, Set.new
dict.sort_by(&:size).reverse.each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if w.subset(u) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
longest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
class String
def subset(w)
w =~ Regexp.new(self.gsub(/./) { |m| "#{m}\\w*" })
end
end
Example
dict = %w| cat catch craft cutie enact trivial rivert river |
#=> ["cat", "catch", "craft", "cutie", "enact", "trivial", "rivert", "river"]
identify_subs_and_sups(dict)
#=> ["river", "rivert", "cat", "catch", "craft"]
Variant
Rather than processing the words in the dictionary from longest to shortest, we could instead order them shortest to longest:
def identify_subs_and_sups1(dict)
neither_sub_nor_sup, shortest_sups = Set.new, Set.new
dict.sort_by(&:size).each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if u.subset(w) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
shortest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
identify_subs_and_sups1(dict)
#=> ["craft", "cat", "rivert", "river"]
Benchmarks
(to be continued...)
1 The OP stated (in a later comment) that s1
is not a substring of s2
if s2.include?(s1) #=> true
. I am going to pretend I never saw that, as it throws a spanner into the works. Unfortunately, subset
is no longer a transitive relation with that additional requirement. I haven't investigate the implications of that, but I suspect it means a rather brutish algorithm would be required, possibly requiring pairwise comparisons of all the words in the dictionary.
Upvotes: 1
Reputation: 702
Here's what I came up with based on your feedback (plus a further optimization by starting with the shortest words):
def collect_dead_words
result = []
collection = @file
num = @file.max_by(&:length).length
1.upto(num) do |index|
subset_by_length = collection.select {|word| word.length == index }
while !subset_by_length.empty? do
wg = WordGame.new(subset_by_length[0])
supermatches = wg.find_superset_words.values.flatten
submatches = wg.find_subset_words.values.flatten
collection.reject! { |cand| supermatches.include? cand }
collection.reject! { |cand| submatches.include? cand }
result << wg.word if (supermatches.empty? && submatches.empty?)
subset.delete(subset_by_length[0])
collection.delete(subset_by_length[0])
end
end
result
end
Further optimizations are welcome!
Upvotes: 1
Reputation: 6076
One way to accomplish this is with Array#delete_if. Here's my run at it so you get the idea:
supersets_and_subsets = []
result.delete_if do |el|
wg.word = el
superset_and_subset = wg.find_superset_words.values.flatten + wg.find_subset_words.values.flatten
supersets_and_subsets << superset_and_subset
!superset_and_subset.empty?
end
result -= supersets_and_subsets.flatten.uniq
Upvotes: 1
Reputation: 198496
It is inadvisable to modify a collection while iterating over it. Instead, either iterate over a copy of the collection, or create a separate array of things to remove later.
Upvotes: 5