Reputation: 69
i have the following content in a file:
gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
For every line in the file I would like to extract the number highlighted above and push it to an array. I'm trying to grep this number and extract it from the matched line, but I don't seem to find the right way to do it.
Here's what I have in mind:
while ($sec_gi = <IN_SIDS>){
$sec_gi =~ s/[0-9]{5,}/$&/;
print $sec_gi."\n";
}
$& is supposed to be the exact match string. With this I get the matched line EXCEPT the match pattern, which is exactly the opposite to what I want.
Could anyone pls help?
Thanks!
Upvotes: 2
Views: 3853
Reputation: 2015
Assuming you don't have to use grep, the following short program will work.
Hope this helps.
Caitlin
#!/usr/bin/perl
use strict;
use warnings;
my @array;
for ( <DATA> )
{
push @array, $1 if /gi\|(\d+)\|/;
}
for (@array) {
print "$_\n";
}
__DATA__
gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|178370902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170593502|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170578993|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170898368|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
Upvotes: 0
Reputation: 67900
Looks like split
is the simplest solution (ETA optimized):
while (<IN_SIDS>) {
my $nums = (split /\|/, $field, 3)[1];
print "$nums\n";
push @array, $nums;
}
I did a benchmark to compare the efficiency to a regex solution:
#!/usr/bin/perl
use strict;
use warnings;
my $data = "gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM";
use Benchmark qw(cmpthese);
cmpthese(shift, {
'Regex' => \®ex,
'Split' => \&splitting
});
sub regex {
if ($data =~ /^[^|]+\|(\d{5,})\|/) {
return $1;
}
}
sub splitting {
return (split /\|/, $data, 3)[1];
}
The result is a draw:
tlp@ubuntu:~/perl$ perl tx.pl 1000000
Rate Split Regex
Split 2083333/s -- -2%
Regex 2127660/s 2% --
Thanks M42 for advice in comments. I picked the split
solution for simplicity and easy maintenance, not performance, but as of now, it is equal to a regex solution.
Upvotes: 5
Reputation: 3609
You need to specify a capture group:
while ($sec_gi = <IN_SIDS>){
$sec_gi =~ s/^.*([0-9]{5,}).*$/$1/;
print $sec_gi."\n";
}
Upvotes: -1
Reputation: 45351
If the value is always your second field you could use this:
while ($sec_gi = <IN_SIDS>) {
if ($sec_gi =~ m/^[^|]*\|([^|]+)/) {
print "$1\n";
}
}
If some of the second fields are not always the one you want (IE you only want 5 or more numbers, as implied) then you could be more specific:
while ($sec_gi = <IN_SIDS>) {
if ($sec_gi =~ m/^[^|]*\|(\d{5,})/) {
print "$1\n";
}
}
If your perl script is doing ONLY this, you can use the gnu coreutil cut
(man cut
).
Upvotes: 0
Reputation: 107060
Might as well give you answer #3:
# Declare Array outside the loop
my @my_array;
while ( $sec_gi = <IN_SIDS> ){
chomp $sec_gi;
# Test if this field actually exists
if ( $sec_gi =~ /([0-9]{5,})/ ) {
# Field exists, push it into your array (or print it)\
push @my_array, $1;
}
else {
# Field doesn't exist: Take appropriate action (which might mean none)
print "Field not found\n";
}
}
# Array @my_array has all of your values
yadda, yadda, yadda
By the way, this will locate the field no matter where on the line it occurs. If this number will only be in field #1, you want to use split
:
my @my_array;
while ( $sec_gi = <IN_SIDS> ) {
chomp $sec_gi;
@sec_gi_array = split /\|/, $sec_gi;
if ( $sec_gi_array[1] =! /[0-9]{5,}/ ) {
push @my_array, $sec_gi_array[1];
}
else {
print "Field not found\n";
}
}
Upvotes: 0
Reputation: 4456
You can use:
$sec_gi =~ s/.*?\|(\d{5,}).*/\1/;
However if it's always in the 2nd column you can use split:
@lst = split('\|', $sec_gi );
$sec_gi = $lst[1];
Upvotes: 0