Reputation: 13
I have many long files but I am interested just in part of the information of each one. So far I have a code that trims the file and gives me the line that contains the information I need, working one file at the time.
This is the code I am using:
#!/usr/bin/perl
use strict;
use warnings;
my $data;
open FILE, "<$ARGV[0]" or die "cannot open file '$ARGV[0]'!\n\n";
while ($data= <FILE>){
chomp $data;
if( $data=~m/\<input type="hidden" name="description" value="454read"><input type="hidden" name="format" value="fasta"><input type="submit" name="submitbutton" value="FASTA"/)
{
$data=~s/[^ACTGN]//g;
print $data;
}
}
And this is the input I get:
<input type="hidden" name="sequence" value="TTGTTGAGCTCGACGGTCATGACCCAGCTGGAGTCGGCACGGGCACCCGCGCGCTTCTGCCAGACGCCAATGTGGGACTTCTCGGTGTCGAGGC"><input type="hidden" name="name" value="FUY784js_7HL"><input type="hidden" name="description" value="454read"><input type="hidden" name="format" value="fasta"><input type="submit" name="submitbutton" value="FASTA">
From this I only need two parts, the TTGTT....AGGC, this part will always be uppercase A,T,C,G,or N, however the length might differ in each file. I also need to save the name for this that in this case is FUY784js_7HL, this name will change every time.
The ideal output should look like this:
FUY784js_7HL
TTGTTGAGCTCGACGGTCATGACCCAGCTGGAGTCGGCACGGGCACCCGCGCGCTTCTGCCAGACGCCAATGTGGGACTTCTCGGTGTCGAGGC
Do you have any idea of how can I do it? I have many files like this. I will appreciate if any of you can help me to figure out how to get this to work for multiple files.
Thanks!
Upvotes: 1
Views: 1359
Reputation: 2340
From what has been posted, I think this would return the sequence:
$data =~ /name="sequence" value="([AGCT]*).*name="name" value="([^"])"/;
print "$2\n$1";
Upvotes: 0
Reputation: 31451
perl -pe 's/[^ACTGN]//g;'
As a proxy for the bit which appears to be problematic, the above command seems to work, at least with the input line starting with <input
and the second output line.
If you don't have any other prints in your real program, I'm not sure how it could produce the line you said it did.
Actually, that was a lie. I got:
TTGTTGAGCTCGACGGTCATGACCCAGCTGGAGTCGGCACGGGCACCCGCGCGCTTCTGCCAGACGCCAATGTGGGACTTCTCGGTGTCGAGGCATA
back because of the FASTA value at the end. If you want to restrict to the main value:
perl -pe 's/.*"([ACTGN]+)".*<input\b[^>]*\bname="name"\s[^>]*\bvalue="([^"]+)".*/$2\n$1/;'
Please note that all of the standard disclaimers about the stupidity and fragility of parsing XML with a regex apply. Specifically, it is perfectly legal to reorder the name and value attributes and this example regex doesn't allow that.
Upvotes: 1
Reputation: 193
If I understand the problem correctly, it looks like making use of capturing groups addresses your need. Specially since you know the beginning and the end but don't know the middle, something like this should work:
$data =~ /TTGTT(.+)AGGC/;
print $1;
Check out the section on capture groups on perldoc: http://perldoc.perl.org/perlre.html#Regular-Expressions
Upvotes: 0