Reputation: 43
I get a question about parse a vector has strings like this:
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"
I want to get:
"M92R R236G"
"G98K"
"M34* G87K M389L"
When I use
while ($info1=~s/^(.*)\:(([A-Z\*]){1}([\d]+)([A-Z\*]){1})\,//)
{
$pos=$2;
}
the result $pos only give me the last one for each row, that is:
"R236G"
"G98K"
"M389L"
How should I correct the script?
Upvotes: 3
Views: 2984
Reputation: 126722
The reason your code isn't working is that you have a greedy ^(.*)
at the start of of the regular expression. That will take up as much of the target string as possible as long as the rest of the pattern matches, so you will find only the last occurrence of the substring. You can fix it by just changing it to a non-greedy pattern ^(.*?)
.
A few other notes on your regular expression:
There is no need to escape :
or ,
, or *
when it is inside a character class [...]
There is never a need for the quantifier {1}
as that is the effect of a pattern without a quantifier
There is no need to put \d
inside a character class [\d]
, as it works fine on its own
There is no need to enclose subpatterns in parentheses unless you need access to whatever substring matched that subpattern when the match succeeds. So, for instance ^.*
is fine without the parentheses
This modification of your code works identically to yours, but is very much more concise
while ($info1 =~ s/^.*?:([A-Z*]\d+[A-Z*]),// ) {
my $pos = $1;
...
}
But the best solution is to use a global match that finds all occurrences of a pattern within a string, and doesn't need to modify the string in the process.
This program does what you describe. It just looks for all the alphanumeric or asterisk strings that follow a colon in each record.
use strict;
use warnings;
while (<DATA>) {
my @fields = /:([A-Z0-9*]+)/g;
print "@fields\n";
}
__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"
output
M92R R236G
G98K
M34* G87K M389L
Upvotes: 2
Reputation: 185053
Using a one-liner :
$ perl -ne 'print q/"/ . join(" ", m/:([^,]+),/g) . qq/"\n/' file
"M92R R236G"
"G98K"
"M34* G87K M389L"
In a script :
$ perl -MO=Deparse -ne 'print "\042" . join(" ", m/:([^,]+),/g) . "\042\n"' file
script :
LINE: while (defined($_ = <ARGV>)) {
print '"' . join(' ', /:([^,]+),/g) . qq["\n];
}
Upvotes: 2
Reputation: 36262
You can use as regex a colon and some alpanumerics characters, use an array to save them and print at the end of the loop. Here you have an example:
#!/usr/bin/env perl;
use strict;
use warnings;
my (@data);
while ( <DATA> ) {
while ( m/:([[:alnum:]*]+)/g ) {
push @data, $1;
}
printf qq|"%s"\n|, join q| |, @data;
undef @data;
}
__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"
Run it like:
perl script.pl
That yields:
"M92R R236G"
"G98K"
"M34* G87K M389L"
Upvotes: 0