Reputation: 438
I have a file with 10m lines, with each line like this:
{ "_id" : ObjectId("567f972cad55ac0797baa773"), "id" : 357103 }
For each line, I need to do something with its "id"
value.
So far I have:
listings.each.with_index do |line, idx|
# listing_id = JSON.parse(line).fetch("id") #>> invalid JSON error
# line.split('"id : "') #=> some gibberish
line.match(/"id" : (.*)/)[1] #=> "357103 }"
parse
throws an error that the lines are not valid json. split
returns some gibberish. The closest result I got to my expectation was match
, but it returns for the above example "357103 }"
.
Can you please help me fix it?
Upvotes: 1
Views: 180
Reputation: 3628
Splitting is faster than Regex here. With such a large file, it might make a noticeable difference.
Also, it looks like you need to escape those double quotes: line.split("\"id\" : ")
> puts Benchmark.measure{line.split("\"id\" : ").last.delete('}').delete(' ')}
0.000000 0.000000 0.000000 ( 0.000020)
> puts Benchmark.measure{line.match(/\s(\d+)\s/)[1]}
0.000000 0.000000 0.000000 ( 0.000043)
Update
Even faster, use splitting all the way:
> puts Benchmark.measure{line.split("\"id\" : ").last.split(' ').first }
0.000000 0.000000 0.000000 ( 0.000008)
Edit
Though as Stefan mentions in his comment, it looks like your file is BSON (MongoDB) not JSON. There is a Mongo gem.
Upvotes: 3
Reputation: 1038
Are the ids made up of all numbers? You can try using a regex that looks for a colon then a list of numbers.
Line.match(/"id " : [0-9]+/)
Looks for Id followed by any length of integers.
If it has letters and numbers then:
Line.match(/"id" :[[:alnum:]]+/)
Upvotes: 1
Reputation: 1942
You can use the \s(\d+)\s
regex, no JSON parsing is required.
line.match(/\s(\d+)\s/)[1] #=> "357103"
Upvotes: 1