Michael
Michael

Reputation: 84

Perl - Output From Regular Expression Match Acts Very Strange, Indeed

I'm using a Perl and regular expressions to parse entries in a (poorly) formatted input text file. My code stores the contents of the input file into $genes, and I've defined a regex with capture groups to store the interesting bits in three variables: $number, $name, and $sequence (see Script.pl snippet below).

This all works perfectly until I attempt to print out the value of $sequence. I'm attempting to add quotes around the values, and my output looks something like this:

Number: '132'
Name: 'rps12 AmtrCp046'
'equence: 'ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA

Number: '134'
Name: 'psbA AmtrCp001'
'equence: 'ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA

Note the missing S in Sequence which has been replaced with a single quote, and note that the sequence itself doesn't have quotes around it as I had intended. I can't figure out why the print statement for $sequence is behaving so strangely. I suspect something is wrong with my regular expression, but I haven't the slightest idea what that could be. Any help would be greatly appreciated!

Script.pl snippet

while ($genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+\s)/g) {
   # Get the value of the first capture group in the matched string (the first bit of stuff in parenthesis)
   # ([0-9+)
   $number = $1;

   # Get the value of the fourth capture group
   # ([A-Za-z0-9]*\s+[A-Za-z0-9]+)
   $name = $4;

   # Get the value of the fifth capture group
   # ([ACGT]+\s)
   $sequence = $5;

   print "Number: \." . $number . "\.\n";
   print "Name: \'" . $name . "\'\n";
   print "sequence: \'" . $sequence . "\'\n";
   print "\n";
}

Input file snippet

132 gnl|Ambtr|rps12 AmtrCp046 ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA AATCTGAATTTTTAGAAATTGTTCATTCAATTAATTTCAAATAACATATTCGTGGAATACGATTCACTTT CAAGATGCCTTGATGGTGAAATGGTAGACACGCGAGACTCAAAATCTCGTGCTAAAGAGCGTGGAGGTTC GAGTCCTCTTCAAGGCATTGAGAATGCTCATTGAATGAGCAATTCAATAACAGAAACAGATCTCGGATCT AATCGATATTGGCAAGTTTCATACGAAGTATTCCGGCGATCCCCACGATCCGAGTCCGAGCTGTTGTTTG ATTTAGTTATTCAGTTAAACCA

>134          gnl|Ambtr|psbA AmtrCp001
ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA
TTGATGGGATCCGTGAACCTGTTTCTGGTTCTCTACTTTATGGAAACAATATTCTTTCTGGTGCCATTAT
TCCAACCTCTGCAGCTATAGGTTTGCATTTTTACCCAATATGGGAAGCGGCATCCGTTGATGAATGGTTA
TACAATGGTGGTCCTTATGAGTTAATTGTCCTACACTTCTTACTTAGTGTAGCTTGTTACATGGGTCGTG
AGTGGGAACTTAGTTTCCGTCTGGGTATGCGCCCTTGGATTGCTGTTGCATATTCAGCTCCTGTTGCAGC
TGCTACTGCTGTTTTCTTGATCTACCCTATTGGTCAAGGAAGTTTCTCAGATGGTATGCCTCTAGGAATA
TCTGGTATTTTCAACTTGATGATTGTATTCCAGGCGGAGCACAACATCCTTATGCACCCATTTCACATGT
TAGGCGTAGCTGGTGTATTCGGCGGCTCCCTATTCAGTGCTATGCATGGTTCCTTGGTAACCTCTAGTTT
GATCAGGGAAACCACTGAAAATGAGTCTGCTAATGCAGGTTACAGATTCGGTCAAGAGGAAGAAACCTAT
AATATCGTAGCTGCTCATGGTTATTTTGGTCGATTGATCTTCCAATATGCTAGTTTCAACAATTCTCGTT
CCTTACATTTCTTCCTAGCTGCTTGGCCCGTAGTAGGTATTTGGTTCACTGCTTTGGGTATTAGCACTAT
GGCTTTCAACCTAAATGGTTTCAATTTCAACCAATCCGTAGTTGACAGTCAAGGTCGTGTCATCAACACT
TGGGCTGATATAATCAACCGTGCTAACCTTGGTATGGAAGTTATGCATGAACGTAATGCTCACAATTTCC
CTCTAGACTTAGCTGCTGTTGAAGCTCCATCTACAAATGGATAA

Upvotes: 1

Views: 87

Answers (2)

Pedro Lobito
Pedro Lobito

Reputation: 99031

  while ($genes =~ m/^.*?([0-9]+).*\|([\w ]+)(.+)$/simg) {

   # Get the value of the first capture group
   $number = $1;

   # Get the value of the second capture group
   $name = $2;

   # Get the value of the third capture group
   # ([ACGT]+\s)
   $sequence = $3;

   print "Number: \." . $number . "\.\n";
   print "Name: \'" . $name . "\'\n";
   print "sequence: \'" . $sequence . "\'\n";
   print "\n";
}

EXPLANATION:

Options: dot matches newline; case insensitive; ^ and $ match at line breaks

Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 1 «([0-9]+)»
   Match a single character in the range between “0” and “9” «[0-9]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “|” literally «\|»
Match the regular expression below and capture its match into backreference number 2 «([\w ]+)»
   Match a single character present in the list below «[\w ]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A word character (letters, digits, and underscores) «\w»
      The character “ ” « »
Match the regular expression below and capture its match into backreference number 3 «(.+)»
   Match any single character «.+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert position at the end of a line (at the end of the string or before a line break character) «$»

Upvotes: 1

choroba
choroba

Reputation: 242333

It seems the input file uses CR+LF to end lines. You store it to $sequence (because \s is inside the capturing parentheses). When printing, it moves the cursor to the beginning of a line, then it prints the final quote, overwriting the "S" in "Sequence".

Solution: Do not capture the final whitespace in the variable.

$genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+)\s/g
#                                                                                        ^^^  

Upvotes: 2

Related Questions