pgblu
pgblu

Reputation: 702

Dynamically deleting elements from an array while enumerating through it

I am going through my system dictionary and looking for words that are, according to a strict definition, neither subsets nor supersets of any other word.

The implementation below does not work, but if it did, it would be pretty efficient, I think. How do I iterate through the array and also remove items from that same array during iteration?

def collect_dead_words
  result = @file #the words in my system dictionary, as an array
  wg = WordGame.new # the class that "knows" the find_subset_words & 
                    # find_superset_words methods
  result.each do |value|
    wg.word = value
    supersets = wg.find_superset_words.values.flatten
    subsets = wg.find_subset_words.values.flatten
    result.delete(value) unless (matches.empty? && subsets.empty?)
    result.reject! { |cand| supersets.include? cand }
    result.reject! { |cand| subsets.include? cand }
  end
  result
end

Note: find_superset_words and find_subset_words both return hashes, hence the values.flatten bit

Upvotes: 1

Views: 106

Answers (4)

Cary Swoveland
Cary Swoveland

Reputation: 110755

The problem

As I understand, string s1 is a subset of string s2 if s1 == s2 after zero or more characters are removed from s2; that is, if there exists a mapping m of the indices of s1 such that1:

  • for each index i of s1, s1[i] = s2[m(i)]; and
  • if i < j then m(i) < m(j).

Further s2 is a superset of s1 if and only if s1 is a subset of s2.

Note that for s1 to be a subset of s2, s1.size <= s2.size must be true.

For example:

  • "cat" is a subset of "craft" because the latter becomes "cat" if the "r" and "f" are removed.
  • "cat" is not a subset of "cutie" because "cutie" has no "a".
  • "cat" is not a superset of "at" because "cat".include?("at") #=> true`.
  • "cat" is not a subset of "enact" because m(0) = 3 and m(1) = 2, but m(0) < m(1) is false;

Algorithm

Subset (and hence superset) is a transitive relation, which permit significant algorithmic efficiencies. By this I mean that if s1 is a subset of s2 and s2 is a subset of s3, then s1 is a subset of s3.

I will proceed as follows:

  • Create empty sets neither_sub_nor_sup and longest_sups and an empty array subs_and_sups.
  • Sort the words in the dictionary by length, longest first.
  • Add w to neither_sub_nor_sup, where w is longest word in the dictionary.
  • For each subsequent word w in the dictionary (longest to shortest), perform the following operations:
    • for each element u of neither_sub_nor_sup determine if w is a subset of u. If it is, move u from neither_sub_nor_sup to longest_sups and append u to subs_and_sups.
    • if one or more elements were moved from from neither_sub_nor_sup to longest_sups, append w to subs_and_sups; else add w to neither_sub_nor_sup.
  • Return subs_and_sups.

Code

require 'set'

def identify_subs_and_sups(dict)
  neither_sub_nor_sup, longest_sups = Set.new, Set.new
  dict.sort_by(&:size).reverse.each_with_object([]) do |w,subs_and_sups|
    switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
      arr << u if w.subset(u) }
    if switchers.any?
      subs_and_sups << w
      switchers.each do |u|
        neither_sub_nor_sup.delete(u)
        longest_sups << u
        subs_and_sups << u
      end
    else
      neither_sub_nor_sup << w
    end
  end
end

class String
  def subset(w)
    w =~ Regexp.new(self.gsub(/./) { |m| "#{m}\\w*" })
  end
end

Example

dict = %w| cat catch craft cutie enact trivial rivert river |  
  #=> ["cat", "catch", "craft", "cutie", "enact", "trivial", "rivert", "river"] 
identify_subs_and_sups(dict)
  #=> ["river", "rivert", "cat", "catch", "craft"]

Variant

Rather than processing the words in the dictionary from longest to shortest, we could instead order them shortest to longest:

def identify_subs_and_sups1(dict)
  neither_sub_nor_sup, shortest_sups = Set.new, Set.new
  dict.sort_by(&:size).each_with_object([]) do |w,subs_and_sups|
    switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
      arr << u if u.subset(w) }
    if switchers.any?
      subs_and_sups << w
      switchers.each do |u|
        neither_sub_nor_sup.delete(u)
        shortest_sups << u
        subs_and_sups << u
      end
    else
      neither_sub_nor_sup << w
    end
  end
end

identify_subs_and_sups1(dict)
  #=> ["craft", "cat", "rivert", "river"]

Benchmarks

(to be continued...)

1 The OP stated (in a later comment) that s1 is not a substring of s2 if s2.include?(s1) #=> true. I am going to pretend I never saw that, as it throws a spanner into the works. Unfortunately, subset is no longer a transitive relation with that additional requirement. I haven't investigate the implications of that, but I suspect it means a rather brutish algorithm would be required, possibly requiring pairwise comparisons of all the words in the dictionary.

Upvotes: 1

pgblu
pgblu

Reputation: 702

Here's what I came up with based on your feedback (plus a further optimization by starting with the shortest words):

def collect_dead_words
  result = []
  collection = @file
  num = @file.max_by(&:length).length
  1.upto(num) do |index|
    subset_by_length = collection.select {|word| word.length == index }
    while !subset_by_length.empty? do
      wg = WordGame.new(subset_by_length[0])
      supermatches = wg.find_superset_words.values.flatten
      submatches = wg.find_subset_words.values.flatten
      collection.reject! { |cand| supermatches.include? cand }
      collection.reject! { |cand| submatches.include? cand }
      result << wg.word if (supermatches.empty? && submatches.empty?)
      subset.delete(subset_by_length[0])
      collection.delete(subset_by_length[0])
    end
  end
  result
end

Further optimizations are welcome!

Upvotes: 1

seph
seph

Reputation: 6076

One way to accomplish this is with Array#delete_if. Here's my run at it so you get the idea:

supersets_and_subsets = []

result.delete_if do |el|
  wg.word = el
  superset_and_subset = wg.find_superset_words.values.flatten + wg.find_subset_words.values.flatten
  supersets_and_subsets << superset_and_subset

  !superset_and_subset.empty?
end

result -= supersets_and_subsets.flatten.uniq

Upvotes: 1

Amadan
Amadan

Reputation: 198496

It is inadvisable to modify a collection while iterating over it. Instead, either iterate over a copy of the collection, or create a separate array of things to remove later.

Upvotes: 5

Related Questions