Reputation: 103
I need help with a regex in Ruby which fails, I did not figured out why. I am using Ruby to grab portions of text from a large bio-database, which has the following structure (I will show just two items for simplicity):
//
ID IPI00303292.1 IPI; PRT; 538 AA.
AC IPI00303292;
DR Superfamily; SSF48371; ARM; 1.
DR UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.
DR CleanEx; HS_KPNA1; -; -.
//
ID IPI00301082.1 IPI; PRT; 309 AA.
AC IPI00301082;
DT 06-JUN-2003 (IPI Human rel. 2.20, Created)
//
i.e. database entries start with a line containing the IPI code and end with a double forward slash. I want to retrieve the information associated with specific IPI codes.
Let's say I want to get only the the text lines of IPI00303292.1
spanning from the IPI code to the following //
.
A Rubular test of /(IPI00303292\.1).*\/\//m
regex grabs the whole displayed text (i.e. two entries) recognizing the last //
while skipping the second between the two.
Update: Hi, based on your valuable suggestions,I think I am close in getting a usable program for my purposes. The code is:
matches = []
no_matches = []
ipi = File.open('mini_alphaIPI.txt').collect do | var | # read the file containing IPI search codes
var = var.chomp
db = File.open('mini_human.dat') # read the file containing IPI data
db.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|
db_record.shift
next if db_record.empty?
matches.push(db_record) if db_record.first.include?(var)
if db_record.first.include?(var) then
matches.push(db_record)
else
no_matches.push(var)
end
end
end
File.open('out_raw.txt', "wb") do |file|
matches.each do |z|
file.puts z
end
end
The last prblem to solve now is that I am getting two copies in the output file of the properly selected positive hits. I cannot get rid of such mistake. Please help .
Upvotes: 0
Views: 512
Reputation: 7725
The regex approach is a very difficult in this case, and I think the problem relies in .
also matching /
.
Almost achieved it with this regex:
%r{
//\n # Match '//' and new line
(?<item> # Capture the item...
[\n\w\s.,;\-\(\)]+ # And here comes the !"#%&@ł
) # You need this to match a single appearance of '/'
}x # e.g., not '//', and partial regex negation is a bit tricky...
However, it would much more easier to just use split('//')
and continue the process from there.
DATA.split('//').each do |item|
item.each_line do |line|
# etc
end
end
HOWDY: this works http://rubular.com/r/kH12xUyxR9
%r{
(//)?\n
(?<item>.+?)
\n//
}xm
But this is just for curiosity, seriously, just use split('//')
.
Upvotes: 0
Reputation: 160551
Ruby comes equipped with slice_before
which is a nice tool for this sort of problem:
require 'pp'
DATA.readlines.slice_before(%r(\A//)).each do |db_record|
pp db_record
end
__END__
//
ID IPI00303292.1 IPI; PRT; 538 AA.
AC IPI00303292;
DR Superfamily; SSF48371; ARM; 1.
DR UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.
DR CleanEx; HS_KPNA1; -; -.
//
ID IPI00301082.1 IPI; PRT; 309 AA.
AC IPI00301082;
DT 06-JUN-2003 (IPI Human rel. 2.20, Created)
//
Running the code outputs:
["//\n", "ID IPI00303292.1 IPI; PRT; 538 AA.\n", "AC IPI00303292;\n", "DR Superfamily; SSF48371; ARM; 1.\n", "DR UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.\n", "DR CleanEx; HS_KPNA1; -; -.\n"] ["//\n", "ID IPI00301082.1 IPI; PRT; 309 AA.\n", "AC IPI00301082;\n", "DT 06-JUN-2003 (IPI Human rel. 2.20, Created)\n"] ["//\n"]
It scans an array, breaking it on the occurrence of lines that match a pattern, which, in this case, is %r(\A//)
, or, in English, "lines that start with two forward slashes." The resulting array of arrays will be each group of records delimited by //
.
Note that the lines have trailing new-lines. That can be fixed using:
DATA.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|
If you want to skip the first sub-array //
entry use:
pp db_record[1..-1]
or:
db_record.shift
pp db_record
After cleanup, the code looks like:
require 'pp'
DATA.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|
db_record.shift
pp db_record
end
And running it looks like:
["ID IPI00303292.1 IPI; PRT; 538 AA.", "AC IPI00303292;", "DR Superfamily; SSF48371; ARM; 1.", "DR UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.", "DR CleanEx; HS_KPNA1; -; -."] ["ID IPI00301082.1 IPI; PRT; 309 AA.", "AC IPI00301082;", "DT 06-JUN-2003 (IPI Human rel. 2.20, Created)"] []
Two tweaks and you're done:
DATA.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|
db_record.shift
next if db_record.empty?
pp db_record if db_record.first['IPI00303292.1']
end
Which outputs:
["ID IPI00303292.1 IPI; PRT; 538 AA.", "AC IPI00303292;", "DR Superfamily; SSF48371; ARM; 1.", "DR UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.", "DR CleanEx; HS_KPNA1; -; -."]
Upvotes: 1
Reputation: 168081
This is a typical problem caused by using the greedy quantifier *
. Use the non-greedy quantifier *?
instead.
Upvotes: 1