Reputation: 1920

Ruby - Regex Multiple lines

I'm looking to run a search through some files to see if they have comments on top of the file.

Here's what I'm searching for:

#++
#    app_name/dir/dir/filename
#    $Id$
#--

I had this as a REGEX and came up short:

:doc => { :test => '^#--\s+[filename]\s+\$Id'
if @file_text =~ Regexp.new(@rules[rule][:test])
....

Any suggestions?

Upvotes: 0

Answers (3)

the Tin Man

Reputation: 160631

Rather than try to do it all in a single pattern, which will become difficult to maintain as your file headers change/grow, instead use several small tests which give you granularity. I'd do something like:

lines = '#++
#    app_name/dir/dir/filename
#    $Id$
#--
'

Split the text so you can retrieve the lines you want, and normalize them:

l1, l2, l3, l4 = lines.split("\n").map{ |s| s.strip.squeeze(' ') }

This is what they contain now:

[l1, l2, l3, l4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]

Here's a set of tests, one for each line:

!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/]) # => true

Here's what is being tested and what each returns:

l1[/^#\+\+/] # => "#++"
l2[/^#\s[\w\/]+/] # => "# app_name/dir/dir/filename"
l3[/^#\s\$Id\$/i] # => "# $Id$"
l4[/^#--/] # => "#--"

There are many different ways to grab the first "n" rows of a file. Here's a few:

File.foreach('test.txt').to_a[0, 4] # => ["#++\n", "#    app_name/dir/dir/filename\n", "#    $Id$\n", "#--\n"]
File.readlines('test.txt')[0, 4]    # => ["#++\n", "#    app_name/dir/dir/filename\n", "#    $Id$\n", "#--\n"]
File.read('test.txt').split("\n")[0, 4] # => ["#++", "#    app_name/dir/dir/filename", "#    $Id$", "#--"]

The downside of these is they all "slurp" the input file, which, on a huge file will cause problems. It's trivial to write a piece of code that'd open a file, read the first four lines, and return them in an array. This is untested but looks about right:

def get_four_lines(path)

  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end

  ary

end

Here's a quick little benchmark to show why I'd go this way:

require 'fruity'

def slurp_file(path)
  File.read(path).split("\n")[0,4] rescue []
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  ary
rescue
  []
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Running that as root outputs:

Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0

That's reading approximately 105 files in my /etc directory.

Modifying the test to actually parse the lines and test to return a true/false:

require 'fruity'

def slurp_file(path)
  ary = File.read(path).split("\n")[0,4] 
  !!(/#\+\+\n(.|\n)*?##\-\-/.match(ary.join("\n")))
rescue
  false # return a consistent value to fruity
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  !!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
  false # return a consistent value to fruity
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Running that again returns:

Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0

Your benchmark isn't fair.

Here's one that's "fair":

require 'fruity'

def slurp_file(path)
  text = File.read(path)
  !!(/#\+\+\n(.|\n)*?##\-\-/.match(text))
rescue
  false # return a consistent value to fruity
end

def read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  !!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
  false # return a consistent value to fruity
end

PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }

compare do
  slurp {
    FILES.each do |f|
      slurp_file(f)
    end
  }

  read_four {
    FILES.each do |f|
      read_first_four_from_file(f)
    end
  }
end

Which outputs:

Running each test once. Test will take about 1 second.
read_four is similar to slurp

joining the split strings back into a longer string prior to doing the match was the wrong path, so working from the full file's content is a more-even test.

[...] Just read the first four lines and apply the pattern, that's it

That's not just it. A multiline regex written to find information spanning multiple lines can't be passed single text lines and return accurate results, so it needs to get a long string. Determining how many characters make up four lines would only add overhead, and slow the algorithm; That's what the previous benchmark did and it wasn't "fair".

Depends on your input data. If you would run this code over a complete (bigger) source code folder, it will slow down it significantly.

There were 105+ files in the directory. That's a reasonably large number of files, but iterating over a large number of files will not show a difference as Ruby's ability to open files isn't the issue, it's the I/O speed of reading a file in one pass vs. line-by-line. And, from experience I know the line-by-line I/O is fast. Again, a benchmark says:

require 'fruity'

LITTLEFILE = 'little.txt'
MEDIUMFILE = 'medium.txt'
BIGFILE = 'big.txt'

LINES = '#++
#    app_name/dir/dir/filename
#    $Id$
#--
'

LITTLEFILE_MULTIPLIER = 1
MEDIUMFILE_MULTIPLIER = 1_000
BIGFILE_MULTIPLIER = 100_000

File.write(BIGFILE, LINES * BIGFILE_MULTIPLIER)

def _slurp_file(path)
  File.read(path)
  true # return a consistent value to fruity
end

def _read_first_four_from_file(path)
  ary = []

  File.open(path, 'r') do |fi|
    4.times do
      ary << fi.readline
    end
  end
  l1, l2, l3, l4 = ary
  true # return a consistent value to fruity
end

[
  [LITTLEFILE, LITTLEFILE_MULTIPLIER],
  [MEDIUMFILE, MEDIUMFILE_MULTIPLIER],
  [BIGFILE,    BIGFILE_MULTIPLIER]
].each do |file, mult|

  File.write(file, LINES * mult)
  puts "Benchmarking against #{ file }"
  puts "%s is %d bytes" % [ file, File.size(file)]

  compare do
    slurp                     { _slurp_file(file)                }
    read_first_four_from_file { _read_first_four_from_file(file) }
  end

  puts
end

With the output:

Benchmarking against little.txt
little.txt is 49 bytes
Running each test 128 times. Test will take about 1 second.
slurp is similar to read_first_four_from_file

Benchmarking against medium.txt
medium.txt is 49000 bytes
Running each test 128 times. Test will take about 1 second.
read_first_four_from_file is faster than slurp by 39.99999999999999% ± 10.0%

Benchmarking against big.txt
big.txt is 4900000 bytes
Running each test 128 times. Test will take about 4 seconds.
read_first_four_from_file is faster than slurp by 100x ± 10.0

Reading a small file of four lines, read is as fast as foreach but once the file size increases the overhead of reading the entire file starts to impact the times.

Any solution relying on slurping files is known to be a bad thing; It's not scalable, and can actually cause code to halt due to memory allocation if BIG files are encountered. Reading the first four lines will always run at a consistent speed independent of the file sizes, so use that technique EVERY time there is a chance that the file sizes will vary. Or, at least, be very aware of the impact on run times and potential problems that can be caused by slurping files.

Upvotes: 1

hek2mgl

Reputation: 158280

Check this example:

string = <<EOF
#++
##    app_name/dir/dir/filename
##    $Id$
##--

foo bar
EOF

puts /#\+\+.*\n##.*\n##.*\n##--/.match(string)

The pattern matches two lines starting with ## between two lines starting with #++ and ending with #-- plus including those boundaries into the match. If I got the question right, this should be what you want.

You can generalize the pattern to match everything between the first #++ and the first #-- (including them) using the following pattern:

puts /#\+\+.*?##--/m.match(string)

Upvotes: 5

Eder

Reputation: 1884

You might want to try the following parttern: \#\+{2}(?:.|[\r\n])*?\#\-{2}

Regular expression visualization

Working demo @ regex101

Upvotes: 0

Ruby - Regex Multiple lines

Answers (3)

Related Questions