Reputation: 3520
I want to parse a large log file (about 500mb). If this isnt the right tool for the job please let me know.
I have a log file with its contents structured like this. Each section can have extra key value pairs:
requestID: saldksadk
time: 92389389
action: foobarr
----------------------
requestID: 2393029
time: 92389389
action: helloworld
source: email
----------------------
requestID: skjflkjasf3
time: 92389389
userAgent: mobile browser
----------------------
requestID: gdfgfdsdf
time: 92389389
action: randoms
I was wondering if there is an easy way to handle each section's data in the log. A section can span multiple lines, so I can't just split the string. For example, is there an easy way to do something like this:
for(section in log){
// handle section contents
}
Upvotes: 2
Views: 268
Reputation: 12251
Using icktoofay's idea, and by using a custom record separator, I got this:
require 'yaml'
File.open("path/to/file") do |f|
f.each_line("\n----------------------\n") do |line|
puts YAML::load(line.sub(/\-{3,}/, "---")).inspect
end
end
The output:
{"requestID"=>"saldksadk", "time"=>92389389, "action"=>"foobarr"}
{"requestID"=>2393029, "time"=>92389389, "action"=>"helloworld", "source"=>"email"}
{"requestID"=>"skjflkjasf3", "time"=>92389389, "userAgent"=>"mobile browser"}
{"requestID"=>"gdfgfdsdf", "time"=>92389389, "action"=>"randoms"}
Upvotes: 5
Reputation: 62648
You can read through the file line-by-line. For each line, we'll check if it's a record separator or a key: value pair. If the former, we'll add the current record to the record list. If the latter, we'll add the k:v pair to the current record.
records = []
record = {}
open("data.txt", "r").each do |line|
if line.start_with? "-"
records << record unless record.empty?
record = {}
else
k, v = line.split(":", 2).map(&:strip)
record[k] = v
end
end
records << record unless record.empty?
This produces something like:
[{"requestID"=>"saldksadk", "time"=>"92389389", "action"=>"foobarr"},
{"requestID"=>"2393029", "time"=>"92389389", "action"=>"helloworld", "source"=>"email"},
{"requestID"=>"skjflkjasf3", "time"=>"92389389", "userAgent"=>"mobile browser"},
{"requestID"=>"gdfgfdsdf", "time"=>"92389389", "action"=>"randoms"}]
Upvotes: 3
Reputation: 28554
Very basic way to do it, that keeps it simple and efficient:
blocks = []
current_block = {}
sep_range = 0..3
sep_value = "----"
split_pattern = /:\s*/
File.open("filename.txt", 'r') do |f|
f.each_line do |line|
if line[sep_range] == sep_value
blocks << current_block unless current_block.empty?
current_block = {}
else
key, value = line.split(split_pattern, 2)
current_block[key] = value
end
end
end
blocks << current_block unless current_block.empty?
Something key to point out is that we are avoiding creating unnecessary duplicate objects inside the loop (the range, test string, and split regex pattern), and instead defining them before the loop begins, this saves a little bit of time and memory. On a file of 500mb, this could be significant.
Upvotes: 1
Reputation: 160551
I saved your sample text to a file called "test.txt". Opening it with:
File.foreach('test.txt').slice_before(/^---/).to_a
returns:
[
["requestID: saldksadk\n", "time: 92389389\n", "action: foobarr\n"],
["----------------------\n", "requestID: 2393029\n", "time: 92389389\n", "action: helloworld\n", "source: email\n"],
["----------------------\n", "requestID: skjflkjasf3\n", "time: 92389389\n", "userAgent: mobile browser\n"],
["----------------------\n", "requestID: gdfgfdsdf\n", "time: 92389389\n", "action: randoms\n"]
]
By running each sub-array through a filter we can strip off the leading "---":
blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
ary.shift if ary.first[/^---/]
ary.map(&:chomp)
}
After running that blocks
is:
[
["requestID: saldksadk", "time: 92389389", "action: foobarr"],
["requestID: 2393029", "time: 92389389", "action: helloworld", "source: email"],
["requestID: skjflkjasf3", "time: 92389389", "userAgent: mobile browser"],
["requestID: gdfgfdsdf", "time: 92389389", "action: randoms"]
]
A bit more tweaking:
blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
ary.shift if ary.first[/^---/]
Hash[ary.map{ |s| s.chomp.split(':') }]
}
and blocks
will be:
[
{"requestID"=>" saldksadk", "time"=>" 92389389", "action"=>" foobarr"},
{"requestID"=>" 2393029", "time"=>" 92389389", "action"=>" helloworld", "source"=>" email"},
{"requestID"=>" skjflkjasf3", "time"=>" 92389389", "userAgent"=>" mobile browser"},
{"requestID"=>" gdfgfdsdf", "time"=>" 92389389", "action"=>" randoms"}
]
Upvotes: 3
Reputation: 129011
That looks like YAML, although it is not exactly YAML. (YAML separates documents with exactly three dashes, no more.) You might try to mangle your document somehow such that lines consisting of only hyphens are collapsed into three hyphens so it is valid YAML. After that, you can feed it into a YAML parser.
Upvotes: 4