Reputation: 3352
Just to analyze my iis log (BONUS: happened to know that iislog is encoded in ASCII, errrr..)
Here's my ruby code
1.readlines
Dir.glob("*.log").each do |filename|
File.readlines(filename,:encoding => "ASCII").each do |line|
#comment line
if line[0] == '#'
next
else
line_content = line.downcase
#just care about first one
matched_keyword = keywords.select { |e| line_content.include? e }[0]
total_count += 1 if extensions.any? { |e| line_content.include? e }
hit_count[matched_keyword] += 1 unless matched_keyword.nil?
end
end
end
2.open
Dir.glob("*.log").each do |filename|
File.open(filename,:encoding => "ASCII").each_line do |line|
#comment line
if line[0] == '#'
next
else
line_content = line.downcase
#just care about first one
matched_keyword = keywords.select { |e| line_content.include? e }[0]
total_count += 1 if extensions.any? { |e| line_content.include? e }
hit_count[matched_keyword] += 1 unless matched_keyword.nil?
end
end
end
"readlines" read the whole file in mem, why "open" always a bit faster on the contrary?? I tested it a couple of times on Win7 Ruby1.9.3
Upvotes: 6
Views: 11339
Reputation: 14082
Both readlines
and open.each_line
read the file only once. And Ruby will do buffering on IO objects, so it will read a block (e.g. 64KB) data from disk every time to minimize the cost on disk read. There should be little time consuming difference in the disk read step.
When you call readlines
, Ruby constructs an empty array []
and repeatedly reads a line of file contents and pushes it to the array. And at last it will return the array containing all lines of the file.
When you call each_line
, Ruby reads a line of file contents and yield it to your logic. When you finished processing this line, ruby reads another line. It repeatedly reads lines until there is no more contents in the file.
The difference between the two method is that readlines
have to append the lines to an array. When the file is large, Ruby might have to duplicate the underlying array (C level) to enlarge its size one or more times.
Digging into the source, readlines
is implemented by io_s_readlines
which calls rb_io_readlines
. rb_io_readlines
calls rb_io_getline_1
to fetch line and rb_ary_push
to push result into the returning array.
each_line
is implemented by rb_io_each_line
which calls rb_io_getline_1
to fetch line just like readlines
and yield the line to your logic with rb_yield
.
So, there is no need to store line results in a growing array for each_line
, no array resizing, copying issue.
Upvotes: 24