Reputation: 1920
I'm looking to run a search through some files to see if they have comments on top of the file.
Here's what I'm searching for:
#++
# app_name/dir/dir/filename
# $Id$
#--
I had this as a REGEX and came up short:
:doc => { :test => '^#--\s+[filename]\s+\$Id'
if @file_text =~ Regexp.new(@rules[rule][:test])
....
Any suggestions?
Upvotes: 0
Views: 859
Reputation: 160551
Rather than try to do it all in a single pattern, which will become difficult to maintain as your file headers change/grow, instead use several small tests which give you granularity. I'd do something like:
lines = '#++
# app_name/dir/dir/filename
# $Id$
#--
'
Split the text so you can retrieve the lines you want, and normalize them:
l1, l2, l3, l4 = lines.split("\n").map{ |s| s.strip.squeeze(' ') }
This is what they contain now:
[l1, l2, l3, l4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]
Here's a set of tests, one for each line:
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/]) # => true
Here's what is being tested and what each returns:
l1[/^#\+\+/] # => "#++"
l2[/^#\s[\w\/]+/] # => "# app_name/dir/dir/filename"
l3[/^#\s\$Id\$/i] # => "# $Id$"
l4[/^#--/] # => "#--"
There are many different ways to grab the first "n" rows of a file. Here's a few:
File.foreach('test.txt').to_a[0, 4] # => ["#++\n", "# app_name/dir/dir/filename\n", "# $Id$\n", "#--\n"]
File.readlines('test.txt')[0, 4] # => ["#++\n", "# app_name/dir/dir/filename\n", "# $Id$\n", "#--\n"]
File.read('test.txt').split("\n")[0, 4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]
The downside of these is they all "slurp" the input file, which, on a huge file will cause problems. It's trivial to write a piece of code that'd open a file, read the first four lines, and return them in an array. This is untested but looks about right:
def get_four_lines(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
ary
end
Here's a quick little benchmark to show why I'd go this way:
require 'fruity'
def slurp_file(path)
File.read(path).split("\n")[0,4] rescue []
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
ary
rescue
[]
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Running that as root outputs:
Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0
That's reading approximately 105 files in my /etc directory.
Modifying the test to actually parse the lines and test to return a true/false:
require 'fruity'
def slurp_file(path)
ary = File.read(path).split("\n")[0,4]
!!(/#\+\+\n(.|\n)*?##\-\-/.match(ary.join("\n")))
rescue
false # return a consistent value to fruity
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
false # return a consistent value to fruity
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Running that again returns:
Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0
Your benchmark isn't fair.
Here's one that's "fair":
require 'fruity'
def slurp_file(path)
text = File.read(path)
!!(/#\+\+\n(.|\n)*?##\-\-/.match(text))
rescue
false # return a consistent value to fruity
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
false # return a consistent value to fruity
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Which outputs:
Running each test once. Test will take about 1 second.
read_four is similar to slurp
joining the split strings back into a longer string prior to doing the match was the wrong path, so working from the full file's content is a more-even test.
[...] Just read the first four lines and apply the pattern, that's it
That's not just it. A multiline regex written to find information spanning multiple lines can't be passed single text lines and return accurate results, so it needs to get a long string. Determining how many characters make up four lines would only add overhead, and slow the algorithm; That's what the previous benchmark did and it wasn't "fair".
Depends on your input data. If you would run this code over a complete (bigger) source code folder, it will slow down it significantly.
There were 105+ files in the directory. That's a reasonably large number of files, but iterating over a large number of files will not show a difference as Ruby's ability to open files isn't the issue, it's the I/O speed of reading a file in one pass vs. line-by-line. And, from experience I know the line-by-line I/O is fast. Again, a benchmark says:
require 'fruity'
LITTLEFILE = 'little.txt'
MEDIUMFILE = 'medium.txt'
BIGFILE = 'big.txt'
LINES = '#++
# app_name/dir/dir/filename
# $Id$
#--
'
LITTLEFILE_MULTIPLIER = 1
MEDIUMFILE_MULTIPLIER = 1_000
BIGFILE_MULTIPLIER = 100_000
File.write(BIGFILE, LINES * BIGFILE_MULTIPLIER)
def _slurp_file(path)
File.read(path)
true # return a consistent value to fruity
end
def _read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
true # return a consistent value to fruity
end
[
[LITTLEFILE, LITTLEFILE_MULTIPLIER],
[MEDIUMFILE, MEDIUMFILE_MULTIPLIER],
[BIGFILE, BIGFILE_MULTIPLIER]
].each do |file, mult|
File.write(file, LINES * mult)
puts "Benchmarking against #{ file }"
puts "%s is %d bytes" % [ file, File.size(file)]
compare do
slurp { _slurp_file(file) }
read_first_four_from_file { _read_first_four_from_file(file) }
end
puts
end
With the output:
Benchmarking against little.txt
little.txt is 49 bytes
Running each test 128 times. Test will take about 1 second.
slurp is similar to read_first_four_from_file
Benchmarking against medium.txt
medium.txt is 49000 bytes
Running each test 128 times. Test will take about 1 second.
read_first_four_from_file is faster than slurp by 39.99999999999999% ± 10.0%
Benchmarking against big.txt
big.txt is 4900000 bytes
Running each test 128 times. Test will take about 4 seconds.
read_first_four_from_file is faster than slurp by 100x ± 10.0
Reading a small file of four lines, read
is as fast as foreach
but once the file size increases the overhead of reading the entire file starts to impact the times.
Any solution relying on slurping files is known to be a bad thing; It's not scalable, and can actually cause code to halt due to memory allocation if BIG files are encountered. Reading the first four lines will always run at a consistent speed independent of the file sizes, so use that technique EVERY time there is a chance that the file sizes will vary. Or, at least, be very aware of the impact on run times and potential problems that can be caused by slurping files.
Upvotes: 1
Reputation: 157967
Check this example:
string = <<EOF
#++
## app_name/dir/dir/filename
## $Id$
##--
foo bar
EOF
puts /#\+\+.*\n##.*\n##.*\n##--/.match(string)
The pattern matches two lines starting with ##
between two lines starting with #++
and ending with #--
plus including those boundaries into the match. If I got the question right, this should be what you want.
You can generalize the pattern to match everything between the first #++
and the first #--
(including them) using the following pattern:
puts /#\+\+.*?##--/m.match(string)
Upvotes: 5