Reputation: 3253
I have some raw data I scraped from a log file, which currently reads as:
" 80: 0.20%: 2/Jan/14 21:01: /site/podcasts/audio/2013/podcast-07-15-2013.mp3",
" 71: 0.16%: 14/Jan/14 12:18: /site/podcasts/audio/2013/podcast-11-04-2013.mp3",
" 67: 0.17%: 2/Jan/14 23:44: /site/podcasts/audio/podcast-3-21-2011.mp3",
" 67: 0.15%: 15/Jan/14 09:25: /site/podcasts/audio/2013/podcast-08-05-2013.mp3",
" 64: 0.12%: 2/Jan/14 07:40: /site/podcasts/audio/2013/podcast-11-04-2013-1.mp3",
I need to convert gather three pieces of information into data for an Excel spreadsheet -- the number before the intitial colon, the date, and the URL. So if I converted it into CSV, it would read as
80, 2/Jan/14, /site/podcasts/audio/2013/podcast-07-15-2013.mp3
71, 14/Jan/14, /site/podcasts/audio/2013/podcast-11-04-2013.mp3
67, 2/Jan/14, /site/podcasts/audio/podcast-3-21-2011.mp3
And so on. However, I'm having trouble figuring out how to do that. I wrote some regexes to capture the right data, but I'm not sure how to convert those regexes into what I need.
There's this regex to get the first number: ^"\s{3}(\d+)
And this regex could get the date: (\d+\/\w{3}\/14)
And this regex could get the URL: (\/site\/podcasts\/audio\/.*\.mp3)
However, I'm not sure how to take these regexes and convert them into the CSV I need. Any ideas?
Upvotes: 0
Views: 1601
Reputation: 18351
I personally wouldn't use regular expressions:
output = ''
File.open("path/to/log", "r") do |f|
f.each_line do |line|
num, percent, date, time, url = line.split(/\s+/)
num = num[0..-2] # removes the colon from the end of the number
output << "#{num}, #{date}, #{url}\n"
end
end
# do whatever you want with the result
puts output
And this prints:
80, 2/Jan/14, /site/podcasts/audio/2013/podcast-07-15-2013.mp3
71, 14/Jan/14, /site/podcasts/audio/2013/podcast-11-04-2013.mp3
67, 2/Jan/14, /site/podcasts/audio/podcast-3-21-2011.mp3
67, 15/Jan/14, /site/podcasts/audio/2013/podcast-08-05-2013.mp3
64, 2/Jan/14, /site/podcasts/audio/2013/podcast-11-04-2013-1.mp3
There are shorter, more clever ways to do this, but I like this way because it's readable and clear.
Upvotes: 1
Reputation: 67968
\s+(\d+):\s+.*?(\d+\/\w+\/\d+)\s+.*?(\/.*?)\".*
Try this.Please look at the demo.
http://regex101.com/r/cA4wE0/10
Upvotes: 1
Reputation: 4659
This puts your matches together and in capture groups that you can then later handle in Ruby. I'm unfamiliar with Ruby but I imagine you can concatenate the strings that the capture-groups return.
^"\s{3}(\d+)(?:[\s:]|\d\.\d\d%)*(\d+\/\w{3}\/14)[\s\d:]*(\/site\/podcasts\/audio\/.*\.mp3)
Upvotes: 1