Reputation: 438

How to extract data from line in ruby?

I have a file with 10m lines, with each line like this:

{ "_id" : ObjectId("567f972cad55ac0797baa773"), "id" : 357103 }

For each line, I need to do something with its "id" value.

So far I have:

listings.each.with_index do |line, idx|
  # listing_id = JSON.parse(line).fetch("id") #>> invalid JSON error
  # line.split('"id : "') #=> some gibberish
  line.match(/"id" : (.*)/)[1] #=> "357103 }"

parse throws an error that the lines are not valid json. split returns some gibberish. The closest result I got to my expectation was match, but it returns for the above example "357103 }".

Can you please help me fix it?

Upvotes: 1

Answers (4)

user3574603

Reputation: 3628

Splitting is faster than Regex here. With such a large file, it might make a noticeable difference.

Also, it looks like you need to escape those double quotes: line.split("\"id\" : ")

> puts Benchmark.measure{line.split("\"id\" : ").last.delete('}').delete(' ')}
  0.000000   0.000000   0.000000 (  0.000020)

> puts Benchmark.measure{line.match(/\s(\d+)\s/)[1]}
  0.000000   0.000000   0.000000 (  0.000043)

Update

Even faster, use splitting all the way:

> puts Benchmark.measure{line.split("\"id\" : ").last.split(' ').first }
  0.000000   0.000000   0.000000 (  0.000008)

Edit

Though as Stefan mentions in his comment, it looks like your file is BSON (MongoDB) not JSON. There is a Mongo gem.

Upvotes: 3

grail

Reputation: 930

You could just match using splicing:

line[/(?<= )\d+/] = 357103

Upvotes: 0

Josh Sharkey

Reputation: 1038

Are the ids made up of all numbers? You can try using a regex that looks for a colon then a list of numbers.

Line.match(/"id " : [0-9]+/)

Looks for Id followed by any length of integers.

If it has letters and numbers then:

Line.match(/"id" :[[:alnum:]]+/)

Upvotes: 1

Gabriel

Reputation: 1942

You can use the \s(\d+)\s regex, no JSON parsing is required.

line.match(/\s(\d+)\s/)[1] #=> "357103"

Upvotes: 1

How to extract data from line in ruby?

Answers (4)

Related Questions