Reputation: 651
I have a file with some writing in the first few lines, then some tabular output. I want to pares the first line and then skip to the tabular output, but am having some trouble (even though it sounds simple). My strategy is to find the header
example input file:
Query [VOG0001]|NC_002014-NP_040572.1| 1296..1562 + 88 aa|G V protein
Match_columns 100
No_of_seqs 7 out of 16
Neff 2.6
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 d1gvpa_ b.40.4.7 (A:) Gene V p 100.0 1.6E-38 1.4E-43 221.5 0.0 87 2-89 1-87 (87)
2 d1gvpa_ b.40.4.7 (A:) Gene V p 100.0 1.6E-38 1.4E-43 221.5 0.0 87 2-89 1-87 (87)
3 d1gvpa_ b.40.4.7 (A:) Gene V p 100.0 1.6E-38 1.4E-43 221.5 0.0 87 2-89 1-87 (87)
attempted parsing script:
open (IN, $hhr_report) or die "cannot open $hhr_report\n";
while (my $line=<IN>){
if ($line =~/^Query/){
my @query=split(/\|/,$line);
my $vogL=$query[0];
my @vogL2=split(/\s+/,$vogL);
$vog=$vogL2[1];
$vog=~ s/\[//g;
$vog=~ s/\]//g;
print "query_array:\t@query\n";
print "query_vog:\t$vog\n";
}
next until ($line =~/Query HMM/);
#next if ($line =~/Query HMM/);
#next until ($line =~/^No\s[0-9]+/);
print "$line\n";
my @columns = split(/\s+/,$line);
... }
I"m not sure if I am missing something simple. But right now I only seem to be parsing the header line (containgin Query HMM), but I want to parse the lines After that.
any help appreciated.
Upvotes: 0
Views: 238
Reputation: 496
I think what you are trying to accomplish can be done more simply. I understand you want to:
If so, you could do something like this:
open (IN, $hhr_report) or die "cannot open $hhr_report\n";
# Get the first line of the file and process it:
my $first_line = <$fh>;
my @query=split(/\|/,$first_line);
my $vogL=$query[0];
my @vogL2=split(/\s+/,$vogL);
my $vog=$vogL2[1];
$vog=~ s/\[//g; #/
$vog=~ s/\]//g; #/
print "query_array:\t@query\n";
print "query_vog:\t$vog\n";
# Work on the rest of the file:
my $in_table = 0;
while (my $line=<IN>){
if ($in_table) {
# process your columns here
print "$line\n";
my @columns = split(/\s+/,$line);
... # the rest of your processing
}
# read (and throw away) lines until you match the table header:
$in_table = 1 if $line =~/Query HMM/;
# next time through the while loop you'll have your
# first tabular data and the $in_table will be true
}
Upvotes: 0
Reputation: 1578
I would try to discard everything up to the header line ( or parse the first line ), and then begin parsing the lines after the header like so:
#!/usr/bin/env perl
use strict;
use warnings;
open (my $fh, "<", $hhr_report) or die "Cannot open $hhr_report: $!";
my $header;
do {
$header = <$fh>;
# If you need to parse lines before the header for some reason,
# do that here
}while( !is_header($header) );
# If you like, parse the header column to get the column names
my @lines;
while ( my $line = <$fh> ){
my @columns = split_line($line);
push @lines, \@columns;
}
sub is_header {
my $line = shift;
return $line =~ /^No\sHit/ ? 1 : 0;
}
sub split_line {
my $line = shift;
# Here, use a regex to split the columns, depending on what you need.
# You could also consider outputting errors if the line is malformatted or missing important values
}
Upvotes: 1