JDRomano2
JDRomano2

Reputation: 517

Extract only the second match to a Regular Expression in Ruby using standard library

I'm writing a Ruby script to go through a text file and pull the second occurrence of a Regular Expression pattern in each line. Here is an example one line of text:

gi|324021737|ref|NM_001204301.1|    gi|324021738|ref|NP_001191230.1|    100.00  459 0   0   1080    2456    294 752 0.0  905

The number I'm trying to get is the one that is gi|324021738 in the above example, but not the gi|324021737 that comes at the beginning of the line. These values always begin with gi|, but the number of digits following them varies.

What would be the most efficient way to append only the second match to the Regex to an array of strings?

Upvotes: 2

Views: 4836

Answers (3)

Martin Velez
Martin Velez

Reputation: 1429

It took me a few seconds to understand the regex posted by @Rohit.

Here is an alternative answer using split. Split the string into groups using the " " character (space). Then split the element at index 1 using the "|". Get the element at index 1. That is the number you are looking for.

s = "gi|324021737|ref|NM_001204301.1|    gi|324021738|ref|NP_001191230.1|    100.00  459 0   0   1080    2456    294 752 0.0  905"
s.split(" ")[1].split("|")[1]

=> "324021738"

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

This would be better processed using split('|') than a regex:

array = []

text = 'gi|324021737|ref|NM_001204301.1|    gi|324021738|ref|NP_001191230.1|    100.00  459 0   0   1080    2456    294 752 0.0  905'
array << text.split('|')[4, 2].map(&:lstrip)
=> [["gi", "324021738"]]

Pipes ("|") are often used to delimit fields in a database output, similar to a comma-separated value file (CSV).

Ruby's CSV is even a better choice:

require 'csv'

text = 'gi|324021737|ref|NM_001204301.1|    gi|324021738|ref|NP_001191230.1|    100.00  459 0   0   1080    2456    294 752 0.0  905'

array = []
CSV.parse(text, :col_sep => '|') do |row|
  array << row[4, 2].map(&:lstrip)
end

array
=> [["gi", "324021738"]]

The reason using CSV might be better than splitting, and especially better than a simple regex, is a delimited file often will escape the delimiting character when it's embedded inside another field. A regex to capture that condition is very difficult to write and maintain. split could do the wrong thing too, which is why it's better to rely on a pre-built/pre-tested "wheel", like CSV.

Upvotes: 2

Rohit Jain
Rohit Jain

Reputation: 213253

You can use this regex: -

"^gi.*?(gi\|\d+).*?$"

And get the group 1 out of it.

Upvotes: 2

Related Questions