Reputation: 517
I'm writing a Ruby script to go through a text file and pull the second occurrence of a Regular Expression pattern in each line. Here is an example one line of text:
gi|324021737|ref|NM_001204301.1| gi|324021738|ref|NP_001191230.1| 100.00 459 0 0 1080 2456 294 752 0.0 905
The number I'm trying to get is the one that is gi|324021738
in the above example, but not the gi|324021737
that comes at the beginning of the line. These values always begin with gi|
, but the number of digits following them varies.
What would be the most efficient way to append only the second match to the Regex to an array of strings?
Upvotes: 2
Views: 4836
Reputation: 1429
It took me a few seconds to understand the regex posted by @Rohit.
Here is an alternative answer using split. Split the string into groups using the " " character (space). Then split the element at index 1 using the "|". Get the element at index 1. That is the number you are looking for.
s = "gi|324021737|ref|NM_001204301.1| gi|324021738|ref|NP_001191230.1| 100.00 459 0 0 1080 2456 294 752 0.0 905"
s.split(" ")[1].split("|")[1]
=> "324021738"
Upvotes: 0
Reputation: 160551
This would be better processed using split('|')
than a regex:
array = []
text = 'gi|324021737|ref|NM_001204301.1| gi|324021738|ref|NP_001191230.1| 100.00 459 0 0 1080 2456 294 752 0.0 905'
array << text.split('|')[4, 2].map(&:lstrip)
=> [["gi", "324021738"]]
Pipes ("|") are often used to delimit fields in a database output, similar to a comma-separated value file (CSV).
Ruby's CSV is even a better choice:
require 'csv'
text = 'gi|324021737|ref|NM_001204301.1| gi|324021738|ref|NP_001191230.1| 100.00 459 0 0 1080 2456 294 752 0.0 905'
array = []
CSV.parse(text, :col_sep => '|') do |row|
array << row[4, 2].map(&:lstrip)
end
array
=> [["gi", "324021738"]]
The reason using CSV might be better than splitting, and especially better than a simple regex, is a delimited file often will escape the delimiting character when it's embedded inside another field. A regex to capture that condition is very difficult to write and maintain. split
could do the wrong thing too, which is why it's better to rely on a pre-built/pre-tested "wheel", like CSV.
Upvotes: 2
Reputation: 213253
You can use this regex: -
"^gi.*?(gi\|\d+).*?$"
And get the group 1 out of it.
Upvotes: 2