user1192137
user1192137

Reputation: 69

Perl: using grep to extract a pattern matching substr of a file line

i have the following content in a file:

gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM

For every line in the file I would like to extract the number highlighted above and push it to an array. I'm trying to grep this number and extract it from the matched line, but I don't seem to find the right way to do it.

Here's what I have in mind:

while ($sec_gi = <IN_SIDS>){
    $sec_gi =~ s/[0-9]{5,}/$&/;
    print $sec_gi."\n";
}

$& is supposed to be the exact match string. With this I get the matched line EXCEPT the match pattern, which is exactly the opposite to what I want.

Could anyone pls help?

Thanks!

Upvotes: 2

Views: 3853

Answers (7)

CaitlinG
CaitlinG

Reputation: 2015

Assuming you don't have to use grep, the following short program will work.

Hope this helps.

Caitlin

#!/usr/bin/perl
use strict;
use warnings;

my @array;

for ( <DATA> )
{
    push @array, $1 if /gi\|(\d+)\|/;
}

for (@array) {
    print "$_\n";
}

__DATA__
gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|178370902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170593502|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170578993|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170898368|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM

Upvotes: 0

TLP
TLP

Reputation: 67900

Looks like split is the simplest solution (ETA optimized):

while (<IN_SIDS>) {
    my $nums  = (split /\|/, $field, 3)[1];
    print "$nums\n";
    push @array, $nums;
}

I did a benchmark to compare the efficiency to a regex solution:

#!/usr/bin/perl
use strict;
use warnings;

my $data = "gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM";

use Benchmark qw(cmpthese);

cmpthese(shift, {
        'Regex' => \&regex,
        'Split' => \&splitting
    });

sub regex {
    if ($data =~ /^[^|]+\|(\d{5,})\|/) {
        return $1;
    }
}

sub splitting {
    return (split /\|/, $data, 3)[1];
}

The result is a draw:

tlp@ubuntu:~/perl$ perl tx.pl 1000000
           Rate Split Regex
Split 2083333/s    --   -2%
Regex 2127660/s    2%    --

Thanks M42 for advice in comments. I picked the split solution for simplicity and easy maintenance, not performance, but as of now, it is equal to a regex solution.

Upvotes: 5

Kyle
Kyle

Reputation: 3609

You need to specify a capture group:

  while ($sec_gi = <IN_SIDS>){
     $sec_gi =~ s/^.*([0-9]{5,}).*$/$1/;
     print $sec_gi."\n";
 }

Upvotes: -1

dlamblin
dlamblin

Reputation: 45351

If the value is always your second field you could use this:

while ($sec_gi = <IN_SIDS>) {
  if ($sec_gi =~ m/^[^|]*\|([^|]+)/) {
    print "$1\n";
  }
}

If some of the second fields are not always the one you want (IE you only want 5 or more numbers, as implied) then you could be more specific:

while ($sec_gi = <IN_SIDS>) {
  if ($sec_gi =~ m/^[^|]*\|(\d{5,})/) {
    print "$1\n";
  }
}

If your perl script is doing ONLY this, you can use the gnu coreutil cut (man cut).

Upvotes: 0

David W.
David W.

Reputation: 107060

Might as well give you answer #3:

# Declare Array outside the loop
my @my_array;
while ( $sec_gi = <IN_SIDS> ){
    chomp $sec_gi;

    # Test if this field actually exists

    if ( $sec_gi =~ /([0-9]{5,})/ ) {

        # Field exists, push it into your array (or print it)\

        push @my_array, $1;
    }
    else {

        # Field doesn't exist: Take appropriate action (which might mean none)

        print "Field not found\n";
    }
}

# Array @my_array has all of your values

yadda, yadda, yadda

By the way, this will locate the field no matter where on the line it occurs. If this number will only be in field #1, you want to use split:

my @my_array;
while ( $sec_gi = <IN_SIDS> ) {
    chomp $sec_gi;
    @sec_gi_array = split /\|/, $sec_gi;
    if ( $sec_gi_array[1] =! /[0-9]{5,}/ ) {
         push @my_array, $sec_gi_array[1];
    }
    else {
         print "Field not found\n";
    }
}

Upvotes: 0

vmpstr
vmpstr

Reputation: 5211

You can also just

$sec_gi =~ /([0-9]{5,})/;

print "$1\n";

Upvotes: 1

xpapad
xpapad

Reputation: 4456

You can use:

$sec_gi =~ s/.*?\|(\d{5,}).*/\1/;

However if it's always in the 2nd column you can use split:

@lst = split('\|', $sec_gi );
$sec_gi = $lst[1];

Upvotes: 0

Related Questions