jack79
jack79

Reputation: 123

How can I merge and process multiple rows from a file to produce a report using perl

I am new perl just tried with little messy code.

cat input1.txt

  ##gff-version 2
  ##source-version geneious 5.6.4
  Xm_ABL1 Geneious  CDS 1   168 .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  CDS 169 334 .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  CDS 335 628 .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  CDS 629 901 .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  CDS 902 985 .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  CDS 986 1165    .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  CDS 1166    1350    .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  CDS 1351    1504    .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
  Xm_ABL1 Geneious  BLAST Hit   169 334 .   +   .   
  Xm_ABL1 Geneious  extracted region    1   168 .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="351297 -> 351464"
  Xm_ABL1 Geneious  extracted region    169 334 .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="371785 -> 371950"
  Xm_ABL1 Geneious  extracted region    335 628 .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="372554 -> 372847"
  Xm_ABL1 Geneious  extracted region    629 901 .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="374760 -> 375032"
  Xm_ABL1 Geneious  extracted region    902 985 .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375230 -> 375313"
  Xm_ABL1 Geneious  extracted region    986 1165    .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375992 -> 376171"
  Xm_ABL1 Geneious  extracted region    1166    1350    .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376575 -> 376759"
  Xm_ABL1 Geneious  extracted region    1351    1504    .   +   .   Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376914 -> 377067"

If input file containg (->) forward arrow.I want output like if($array[7]=~/.*interval=\"\d+ -> \d+\"$/gm){ $array[5]="+"; }

cat output1.txt

gi_371443098_gb_JH556762.1  gene    351297  377067  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 351297  351464  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 371785  371950  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 372554  372847  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 374760  375032  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 375230  375313  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 375992  376171  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 376575  376759  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 376914  377067  .   +   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
###

cat output1.txt If input file containg (<-) reverse arrow. if($array[7]=~/.*interval=\"\d+ <- \d+\"$/gm){ $array[5]="-"; }

gi_371443098_gb_JH556762.1  gene    351297  377067  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 351297  351464  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 371785  371950  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 372554  372847  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 374760  375032  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 375230  375313  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 375992  376171  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 376575  376759  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1  CDS 376914  377067  .   -   .   Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
###

I have tried with little messy code, as I am beginner.

#usr/bin/perl;
use strict;
open(FH,"$ARGV[0]");

while(<FH>){
chomp $_; 
my @array=split("\t");
my $key="$array[2]-$array[0]-$array[1]-$array[2]-$array[3]"; 
if($array[1] eq "CDS"){
$cds_cnt{$key}++;
$cds{$key}="$array[4]\t$array[5]\t$array[6]\t$array[7]";
    }
if($array[1] eq "extracted region"){ 
    (my $pos1,my $pos2)=($array[7]=~/.*interval=\"(\d+) -> (\d+)\"$/gm);
$extract_cnt{$key}++;
$extract{$key}="$pos1\t$pos2";
            }
}

 foreach $i (  sort {$a<=>$b} keys %cds){ 
 my $a=$i; #print "$i\n";
 $a=~s/CDS/extracted region/g;
 if($cds_cnt{$i} == $extract_cnt{$a}){
 #print "$i\t$cds{$i}\n$a\t$extract{$a}\n"; 
 my @array=split /\-/,$i;
 my @pos=split "\t",$extract{$a};
 print "$array[1]\t$array[2]\t$pos[0]\t$pos[1]\t$cds{$i}\n";
   }
  }
 print "###";

Update

What I need to modify in my code

1.To get value from the row of extracted region, (i.e array[7]=/gi|371443098|gb|JH556762.1|/) it can be any value, add underscore to it (i.e gi_371443098_gb_JH556762.1) and print in array[0] in the output1.txt as shown.

2.Add new line as first row while printing (gi_371443098_gb_JH556762.1 gene), in column 3 get starting value of CDS (i.e 351297) and get ending value of CDS in column 4 (i.e 377067) and print as in the first row as shown in ouput1.txt

3.If /extracted region/ block-all rows for.e.g.Extracted interval="351297 -> 351464" (i.e forward arrow) print array[5] as "+" symbol including gene header in the output. if e.g.Extracted interval="351297 <- 351464" (reverse arrow) print array[5] as "-" symbol including gene header in the output.

Upvotes: 1

Views: 305

Answers (1)

RET
RET

Reputation: 9188

It looks like what you're trying to achieve is to merge the details from the line labelled CDS with the matching line labelled extracted region, and then print the merged results with a leading summary header, based on some minimum and maximum values, grouped by Name. Is that correct?

I'm going to assume that what you call $array[0] (Xm_ABL1 Geneious) and $array[2] (169, 335 etc) is sufficient to marry them together, but that's not very clear in your example.

Your first question is just a regexp, which I think you've got the general hang of. I think the issue is how you've gone about capturing your data.

To do the second thing you ask, capture the hi and lo values in your first pass, and store them.

I wasn't planning on writing a complete solution, yet here it is...

use strict;
use warnings;

my $metadata = {}; # hashref to store CDS info in..
my $group = {}; # hashref to store summary/detail in..
my $arrow = { "->" => '+', "<-" => '-' }; # decode arrow to pos/neg

open(FH,"$ARGV[0]");
while(<FH>){
    chomp;
    next if /^#/;
    my @array=split("\t");
    my $key = join(":", $array[0], $array[2]);
    if ($array[1] =~ /CDS/){
        $metadata->{$key} = $array[7];
    }
    if ($array[1] =~ /extracted region/){
        #assert CDS already processed..
        die "No CDS record for $key!\n" unless $metadata->{$key};
        (my $label = $array[7]) =~ s/.*region from (.*)\|;.*/$1/;
        $label =~ s/\|/_/g;
        $group->{$label} ||= { #seed summary if not exists
                pos1 => 1e10,
                pos2 => 0,
                metadata => $metadata->{$key},
                sequences => [],
        };
        (my $pos1, my $arr, my $pos2) = ($array[7]=~/.*interval=\"(\d+) (<?->?) (\d+)\"$/gm);
        # capture hi/lo values for group
        $group->{$label}->{pos1} = $pos1 if $pos1 < $group->{$label}->{pos1};
        $group->{$label}->{pos2} = $pos2 if $pos2 > $group->{$label}->{pos2};
        # push this sequence onto the group's array
        push(@{ $group->{$label}->{sequences} }, [ $pos1, $pos2, $arrow->{$arr} ]);
    }
}
for my $gene (sort keys %{ $group }){
    #write out header
    printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
        $gene, 'gene',
        $group->{$gene}->{pos1}, $group->{$gene}->{pos2},
        $group->{$gene}->{sequences}->[0]->[2],
        $group->{$gene}->{metadata};
    foreach my $sequence ( @{ $group->{$gene}->{sequences} } ){
        # write out details
        printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
            $gene, 'CDS',
            $sequence->[0], $sequence->[1], $sequence->[2],
            $group->{$gene}->{metadata};
    }
}
print "###\n";

I hope that's sufficiently commented to make sense. Written like this, it will be a lot easier to maintain if you have to come back and modify it after six months has passed.

UPDATE

I've modified this code following the 4th comment. The $array[7] regexp block now captures the value of $array[5] and stores it in the sequence arrayref.

UPDATE #2

  • Note use of $arrow hashref to decode -> and <- to the symbols you require. (line 6)
  • Gene header shows + or - based on value in 3rd field of first sequence. There's an assumption that all sequences for a gene have the same direction. (11 lines from end)

I think we've strayed off the Q&A highway and into the Free Software Development Bureau now. What I've written is not complex code, it has comments and structure. It's time for you to come to grips with the logic of it. And up-vote my answer.

Upvotes: 2

Related Questions