Maurizio Cirilli
Maurizio Cirilli

Reputation: 103

Double forward slash skipped by a regex in Ruby

I need help with a regex in Ruby which fails, I did not figured out why. I am using Ruby to grab portions of text from a large bio-database, which has the following structure (I will show just two items for simplicity):

//
ID   IPI00303292.1         IPI;      PRT;   538 AA.
AC   IPI00303292;
DR   Superfamily; SSF48371; ARM; 1.
DR   UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.
DR   CleanEx; HS_KPNA1; -; -.
//
ID   IPI00301082.1         IPI;      PRT;   309 AA.
AC   IPI00301082;
DT   06-JUN-2003 (IPI Human rel. 2.20, Created)
//

i.e. database entries start with a line containing the IPI code and end with a double forward slash. I want to retrieve the information associated with specific IPI codes. Let's say I want to get only the the text lines of IPI00303292.1 spanning from the IPI code to the following //.

A Rubular test of /(IPI00303292\.1).*\/\//m regex grabs the whole displayed text (i.e. two entries) recognizing the last // while skipping the second between the two.

Update: Hi, based on your valuable suggestions,I think I am close in getting a usable program for my purposes. The code is:

matches = []
no_matches = []

ipi = File.open('mini_alphaIPI.txt').collect do | var | # read the file containing IPI search codes
    var = var.chomp 

db = File.open('mini_human.dat') # read the file containing IPI data

db.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|
  db_record.shift
  next if db_record.empty?

matches.push(db_record) if db_record.first.include?(var)

if db_record.first.include?(var)  then
    matches.push(db_record)
    else
    no_matches.push(var)
end
end
end

File.open('out_raw.txt', "wb") do |file|
     matches.each do |z|
      file.puts z
  end
end

The last prblem to solve now is that I am getting two copies in the output file of the properly selected positive hits. I cannot get rid of such mistake. Please help .

Upvotes: 0

Views: 512

Answers (3)

ichigolas
ichigolas

Reputation: 7725

  • List item

The regex approach is a very difficult in this case, and I think the problem relies in . also matching /.

Almost achieved it with this regex:

%r{
  //\n                  # Match '//' and new line
  (?<item>              # Capture the item...
    [\n\w\s.,;\-\(\)]+  # And here comes the !"#%&@ł
  )                     # You need this to match a single appearance of '/' 
}x                      # e.g., not '//', and partial regex negation is a bit tricky... 

However, it would much more easier to just use split('//') and continue the process from there.

DATA.split('//').each do |item|
  item.each_line do |line|
    # etc
  end
end

HOWDY: this works http://rubular.com/r/kH12xUyxR9

%r{
  (//)?\n
  (?<item>.+?)
  \n//
}xm

But this is just for curiosity, seriously, just use split('//').

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

Ruby comes equipped with slice_before which is a nice tool for this sort of problem:

require 'pp'

DATA.readlines.slice_before(%r(\A//)).each do |db_record|
  pp db_record
end

__END__
//
ID   IPI00303292.1         IPI;      PRT;   538 AA.
AC   IPI00303292;
DR   Superfamily; SSF48371; ARM; 1.
DR   UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.
DR   CleanEx; HS_KPNA1; -; -.
//
ID   IPI00301082.1         IPI;      PRT;   309 AA.
AC   IPI00301082;
DT   06-JUN-2003 (IPI Human rel. 2.20, Created)
//

Running the code outputs:

["//\n",
 "ID   IPI00303292.1         IPI;      PRT;   538 AA.\n",
 "AC   IPI00303292;\n",
 "DR   Superfamily; SSF48371; ARM; 1.\n",
 "DR   UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.\n",
 "DR   CleanEx; HS_KPNA1; -; -.\n"]
["//\n",
 "ID   IPI00301082.1         IPI;      PRT;   309 AA.\n",
 "AC   IPI00301082;\n",
 "DT   06-JUN-2003 (IPI Human rel. 2.20, Created)\n"]
["//\n"]

It scans an array, breaking it on the occurrence of lines that match a pattern, which, in this case, is %r(\A//), or, in English, "lines that start with two forward slashes." The resulting array of arrays will be each group of records delimited by //.

Note that the lines have trailing new-lines. That can be fixed using:

DATA.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|

If you want to skip the first sub-array // entry use:

pp db_record[1..-1]

or:

db_record.shift
pp db_record

After cleanup, the code looks like:

require 'pp'

DATA.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|
    db_record.shift
    pp db_record
end

And running it looks like:

["ID   IPI00303292.1         IPI;      PRT;   538 AA.",
 "AC   IPI00303292;",
 "DR   Superfamily; SSF48371; ARM; 1.",
 "DR   UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.",
 "DR   CleanEx; HS_KPNA1; -; -."]
["ID   IPI00301082.1         IPI;      PRT;   309 AA.",
 "AC   IPI00301082;",
 "DT   06-JUN-2003 (IPI Human rel. 2.20, Created)"]
[]

Two tweaks and you're done:

DATA.readlines.map(&:chomp).slice_before(%r(\A//)).each do |db_record|
  db_record.shift
  next if db_record.empty?

  pp db_record if db_record.first['IPI00303292.1']

end

Which outputs:

["ID   IPI00303292.1         IPI;      PRT;   538 AA.",
 "AC   IPI00303292;",
 "DR   Superfamily; SSF48371; ARM; 1.",
 "DR   UniProt/Swiss-Prot; P52294; IMA1_HUMAN; M.",
 "DR   CleanEx; HS_KPNA1; -; -."]

Upvotes: 1

sawa
sawa

Reputation: 168081

This is a typical problem caused by using the greedy quantifier *. Use the non-greedy quantifier *? instead.

Upvotes: 1

Related Questions