peter
peter

Reputation: 42207

ruby automate multiple regular expression replacements

i want do do multiple regular expression replacements on a array, i have this working code but it seems not the ruby-way, anyone who has a better solution ?

#files contains the string that need cleaning
files = [
   "Beatles - The Word ",
  "The Beatles - The Word",
  "Beatles - Tell Me Why",
  "Beatles - Tell Me Why (remastered)",
  "Beatles - Love me do"
]

#ignore contains the reg expr that need to bee checked
ignore = [/the/,/\(.*\)/,/remastered/,/live/,/remix/,/mix/,/acoustic/,/version/,/  +/]

files.each do |file|
  ignore.each do |e|
    file.downcase!
    file.gsub!(e," ")
    file.strip!
  end
end
p files
#=>["beatles - word", "beatles - word", "beatles - tell me why", "beatles - tell me why", "beatles - love me do"]

Upvotes: 1

Views: 172

Answers (3)

peter
peter

Reputation: 42207

I made this solution from your answers, 2 versions, one with a conversion to string (doesn't change the files array and one with an extend of Array which does change the files array itself. The class approuch is 2x faster. If onyone still has suggestions, please share them.

files = [
   "Beatles - The Word ",
  "The Beatles - The Word",
  "Beatles - Tell Me Why",
  "The Beatles - Tell Me Why (remastered)",
  "Beatles - wordwiththein wordwithlivein"
]

ignore = /\(.*\)|[_]|\b(the|remastered|live|remix|mix|acoustic|version)\b/

class Array
  def cleanup ignore
    self.each do |e|
      e.downcase!
      e.gsub!(ignore," ")
      e.gsub!(/  +/," ")
      e.strip!
    end
  end
end

p files.join("#").downcase!.gsub(ignore," ").gsub(/  +/," ").split(/ *# */)
#=>["beatles - word", "beatles - word", "beatles - tell me why", "beatles - tell me why", "beatles - wordwiththein wordwithlivein"]

Benchmark.bm do |x| 
  x.report("string method")  { 10000.times { files.join("#").downcase!.gsub(ignore," ").gsub(/  +/," ").split(/ *# */) } }
  x.report("class  method")   { 10000.times { files.cleanup ignore } }
end

=begin
       user     system      total        real
string method  0.328000   0.000000   0.328000 (  0.327600)
class  method  0.187000   0.000000   0.187000 (  0.187200)
=end

Upvotes: 0

steenslag
steenslag

Reputation: 80075

ignore = ["the", "(", ".",  "*", ")", "remastered", "live", "remix",  "mix", "acoustic", "version", "+"]
re = Regexp.union(ignore)
p re #=> /the|\(|\.|\*|\)|remastered|live|remix|mix|acoustic|version|\+/

Regexp.union takes care of escaping.

Upvotes: 3

Tim Pietzcker
Tim Pietzcker

Reputation: 336368

You can put most of these in a single regex replace operation. Also, you should be using word boundary anchors (\b) or for example the will also match There's a Place.

file.gsub!(/(?:\b(?:the|remastered|live|remix|mix|acoustic|version)\b)|\([^()]*\)/, ' ')

should take care of this.

Then, you can strip multiple spaces in a second step:

file.gsub!(/  +/, ' ')

If you want to keep the regexes in an array, then you do need to iterate through the array and do the replacements for each regex. But you can at least take some commands out of the loop:

files.each do |file|
  file.downcase!
  ignore.each do |e|
    file.gsub!(e," ")
  end
  file.strip!
end

Of course, then you will need to put word boundaries around each word in your ignore list:

ignore = [/\bthe\b/, /\([^()]*\)/, /\bremastered\b/, ...]

Upvotes: 1

Related Questions