Reputation: 84
I'm using a Perl and regular expressions to parse entries in a (poorly) formatted input text file. My code stores the contents of the input file into $genes, and I've defined a regex with capture groups to store the interesting bits in three variables: $number, $name, and $sequence (see Script.pl snippet below).
This all works perfectly until I attempt to print out the value of $sequence. I'm attempting to add quotes around the values, and my output looks something like this:
Number: '132'
Name: 'rps12 AmtrCp046'
'equence: 'ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA
Number: '134'
Name: 'psbA AmtrCp001'
'equence: 'ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA
Note the missing S in Sequence which has been replaced with a single quote, and note that the sequence itself doesn't have quotes around it as I had intended. I can't figure out why the print statement for $sequence is behaving so strangely. I suspect something is wrong with my regular expression, but I haven't the slightest idea what that could be. Any help would be greatly appreciated!
Script.pl snippet
while ($genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+\s)/g) {
# Get the value of the first capture group in the matched string (the first bit of stuff in parenthesis)
# ([0-9+)
$number = $1;
# Get the value of the fourth capture group
# ([A-Za-z0-9]*\s+[A-Za-z0-9]+)
$name = $4;
# Get the value of the fifth capture group
# ([ACGT]+\s)
$sequence = $5;
print "Number: \." . $number . "\.\n";
print "Name: \'" . $name . "\'\n";
print "sequence: \'" . $sequence . "\'\n";
print "\n";
}
Input file snippet
132 gnl|Ambtr|rps12 AmtrCp046 ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA AATCTGAATTTTTAGAAATTGTTCATTCAATTAATTTCAAATAACATATTCGTGGAATACGATTCACTTT CAAGATGCCTTGATGGTGAAATGGTAGACACGCGAGACTCAAAATCTCGTGCTAAAGAGCGTGGAGGTTC GAGTCCTCTTCAAGGCATTGAGAATGCTCATTGAATGAGCAATTCAATAACAGAAACAGATCTCGGATCT AATCGATATTGGCAAGTTTCATACGAAGTATTCCGGCGATCCCCACGATCCGAGTCCGAGCTGTTGTTTG ATTTAGTTATTCAGTTAAACCA
>134 gnl|Ambtr|psbA AmtrCp001
ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA
TTGATGGGATCCGTGAACCTGTTTCTGGTTCTCTACTTTATGGAAACAATATTCTTTCTGGTGCCATTAT
TCCAACCTCTGCAGCTATAGGTTTGCATTTTTACCCAATATGGGAAGCGGCATCCGTTGATGAATGGTTA
TACAATGGTGGTCCTTATGAGTTAATTGTCCTACACTTCTTACTTAGTGTAGCTTGTTACATGGGTCGTG
AGTGGGAACTTAGTTTCCGTCTGGGTATGCGCCCTTGGATTGCTGTTGCATATTCAGCTCCTGTTGCAGC
TGCTACTGCTGTTTTCTTGATCTACCCTATTGGTCAAGGAAGTTTCTCAGATGGTATGCCTCTAGGAATA
TCTGGTATTTTCAACTTGATGATTGTATTCCAGGCGGAGCACAACATCCTTATGCACCCATTTCACATGT
TAGGCGTAGCTGGTGTATTCGGCGGCTCCCTATTCAGTGCTATGCATGGTTCCTTGGTAACCTCTAGTTT
GATCAGGGAAACCACTGAAAATGAGTCTGCTAATGCAGGTTACAGATTCGGTCAAGAGGAAGAAACCTAT
AATATCGTAGCTGCTCATGGTTATTTTGGTCGATTGATCTTCCAATATGCTAGTTTCAACAATTCTCGTT
CCTTACATTTCTTCCTAGCTGCTTGGCCCGTAGTAGGTATTTGGTTCACTGCTTTGGGTATTAGCACTAT
GGCTTTCAACCTAAATGGTTTCAATTTCAACCAATCCGTAGTTGACAGTCAAGGTCGTGTCATCAACACT
TGGGCTGATATAATCAACCGTGCTAACCTTGGTATGGAAGTTATGCATGAACGTAATGCTCACAATTTCC
CTCTAGACTTAGCTGCTGTTGAAGCTCCATCTACAAATGGATAA
Upvotes: 1
Views: 87
Reputation: 99031
while ($genes =~ m/^.*?([0-9]+).*\|([\w ]+)(.+)$/simg) {
# Get the value of the first capture group
$number = $1;
# Get the value of the second capture group
$name = $2;
# Get the value of the third capture group
# ([ACGT]+\s)
$sequence = $3;
print "Number: \." . $number . "\.\n";
print "Name: \'" . $name . "\'\n";
print "sequence: \'" . $sequence . "\'\n";
print "\n";
}
EXPLANATION:
Options: dot matches newline; case insensitive; ^ and $ match at line breaks
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 1 «([0-9]+)»
Match a single character in the range between “0” and “9” «[0-9]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “|” literally «\|»
Match the regular expression below and capture its match into backreference number 2 «([\w ]+)»
Match a single character present in the list below «[\w ]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A word character (letters, digits, and underscores) «\w»
The character “ ” « »
Match the regular expression below and capture its match into backreference number 3 «(.+)»
Match any single character «.+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert position at the end of a line (at the end of the string or before a line break character) «$»
Upvotes: 1
Reputation: 242333
It seems the input file uses CR+LF to end lines. You store it to $sequence (because \s
is inside the capturing parentheses). When printing, it moves the cursor to the beginning of a line, then it prints the final quote, overwriting the "S" in "Sequence".
Solution: Do not capture the final whitespace in the variable.
$genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+)\s/g
# ^^^
Upvotes: 2