user2917442
user2917442

Reputation: 43

Perl while loop when parse a string

I get a question about parse a vector has strings like this:

"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"

I want to get:

"M92R R236G"
"G98K"
"M34* G87K M389L"

When I use

while ($info1=~s/^(.*)\:(([A-Z\*]){1}([\d]+)([A-Z\*]){1})\,//) 
{
    $pos=$2; 
}

the result $pos only give me the last one for each row, that is:

"R236G"
"G98K"
"M389L"

How should I correct the script?

Upvotes: 3

Views: 2984

Answers (3)

Borodin
Borodin

Reputation: 126722

The reason your code isn't working is that you have a greedy ^(.*) at the start of of the regular expression. That will take up as much of the target string as possible as long as the rest of the pattern matches, so you will find only the last occurrence of the substring. You can fix it by just changing it to a non-greedy pattern ^(.*?).

A few other notes on your regular expression:

  • There is no need to escape : or ,, or * when it is inside a character class [...]

  • There is never a need for the quantifier {1} as that is the effect of a pattern without a quantifier

  • There is no need to put \d inside a character class [\d], as it works fine on its own

  • There is no need to enclose subpatterns in parentheses unless you need access to whatever substring matched that subpattern when the match succeeds. So, for instance ^.* is fine without the parentheses

This modification of your code works identically to yours, but is very much more concise

while ($info1 =~ s/^.*?:([A-Z*]\d+[A-Z*]),// ) {
  my $pos = $1;
  ...
}

But the best solution is to use a global match that finds all occurrences of a pattern within a string, and doesn't need to modify the string in the process.

This program does what you describe. It just looks for all the alphanumeric or asterisk strings that follow a colon in each record.

use strict;
use warnings;

while (<DATA>) {
  my @fields = /:([A-Z0-9*]+)/g;
  print "@fields\n";
}

__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"

output

M92R R236G
G98K
M34* G87K M389L

Upvotes: 2

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185053

Using a one-liner :

$ perl -ne 'print q/"/ . join(" ", m/:([^,]+),/g) . qq/"\n/' file
"M92R R236G"
"G98K" 
"M34* G87K M389L"

In a script :

$ perl -MO=Deparse -ne 'print "\042" . join(" ", m/:([^,]+),/g) . "\042\n"' file

script :

LINE: while (defined($_ = <ARGV>)) {
    print '"' . join(' ', /:([^,]+),/g) . qq["\n];
}

Upvotes: 2

Birei
Birei

Reputation: 36262

You can use as regex a colon and some alpanumerics characters, use an array to save them and print at the end of the loop. Here you have an example:

#!/usr/bin/env perl;

use strict;
use warnings;

my (@data);

while ( <DATA> ) { 
    while ( m/:([[:alnum:]*]+)/g ) { 
        push @data, $1; 
    }   
    printf qq|"%s"\n|, join q| |, @data;
    undef @data;
}

__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"

Run it like:

perl script.pl

That yields:

"M92R R236G"
"G98K"
"M34* G87K M389L"

Upvotes: 0

Related Questions