Reputation: 123
I am new perl just tried with little messy code.
cat input1.txt
##gff-version 2
##source-version geneious 5.6.4
Xm_ABL1 Geneious CDS 1 168 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 169 334 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 335 628 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 629 901 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 902 985 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 986 1165 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 1166 1350 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 1351 1504 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious BLAST Hit 169 334 . + .
Xm_ABL1 Geneious extracted region 1 168 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="351297 -> 351464"
Xm_ABL1 Geneious extracted region 169 334 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="371785 -> 371950"
Xm_ABL1 Geneious extracted region 335 628 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="372554 -> 372847"
Xm_ABL1 Geneious extracted region 629 901 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="374760 -> 375032"
Xm_ABL1 Geneious extracted region 902 985 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375230 -> 375313"
Xm_ABL1 Geneious extracted region 986 1165 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375992 -> 376171"
Xm_ABL1 Geneious extracted region 1166 1350 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376575 -> 376759"
Xm_ABL1 Geneious extracted region 1351 1504 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376914 -> 377067"
If input file containg (->) forward arrow.I want output like if($array[7]=~/.*interval=\"\d+ -> \d+\"$/gm){ $array[5]="+"; }
cat output1.txt
gi_371443098_gb_JH556762.1 gene 351297 377067 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 351297 351464 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 371785 371950 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 372554 372847 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 374760 375032 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 375230 375313 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 375992 376171 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 376575 376759 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 376914 377067 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
###
cat output1.txt If input file containg (<-) reverse arrow. if($array[7]=~/.*interval=\"\d+ <- \d+\"$/gm){ $array[5]="-"; }
gi_371443098_gb_JH556762.1 gene 351297 377067 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 351297 351464 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 371785 371950 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 372554 372847 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 374760 375032 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 375230 375313 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 375992 376171 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 376575 376759 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 376914 377067 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
###
I have tried with little messy code, as I am beginner.
#usr/bin/perl;
use strict;
open(FH,"$ARGV[0]");
while(<FH>){
chomp $_;
my @array=split("\t");
my $key="$array[2]-$array[0]-$array[1]-$array[2]-$array[3]";
if($array[1] eq "CDS"){
$cds_cnt{$key}++;
$cds{$key}="$array[4]\t$array[5]\t$array[6]\t$array[7]";
}
if($array[1] eq "extracted region"){
(my $pos1,my $pos2)=($array[7]=~/.*interval=\"(\d+) -> (\d+)\"$/gm);
$extract_cnt{$key}++;
$extract{$key}="$pos1\t$pos2";
}
}
foreach $i ( sort {$a<=>$b} keys %cds){
my $a=$i; #print "$i\n";
$a=~s/CDS/extracted region/g;
if($cds_cnt{$i} == $extract_cnt{$a}){
#print "$i\t$cds{$i}\n$a\t$extract{$a}\n";
my @array=split /\-/,$i;
my @pos=split "\t",$extract{$a};
print "$array[1]\t$array[2]\t$pos[0]\t$pos[1]\t$cds{$i}\n";
}
}
print "###";
Update
What I need to modify in my code
1.To get value from the row of extracted region, (i.e array[7]=/gi|371443098|gb|JH556762.1|/) it can be any value, add underscore to it (i.e gi_371443098_gb_JH556762.1) and print in array[0] in the output1.txt as shown.
2.Add new line as first row while printing (gi_371443098_gb_JH556762.1 gene), in column 3 get starting value of CDS (i.e 351297) and get ending value of CDS in column 4 (i.e 377067) and print as in the first row as shown in ouput1.txt
3.If /extracted region/ block-all rows for.e.g.Extracted interval="351297 -> 351464" (i.e forward arrow) print array[5] as "+" symbol including gene header in the output. if e.g.Extracted interval="351297 <- 351464" (reverse arrow) print array[5] as "-" symbol including gene header in the output.
Upvotes: 1
Views: 305
Reputation: 9188
It looks like what you're trying to achieve is to merge the details from the line labelled CDS with the matching line labelled extracted region, and then print the merged results with a leading summary header, based on some minimum and maximum values, grouped by Name. Is that correct?
I'm going to assume that what you call $array[0] (Xm_ABL1 Geneious) and $array[2] (169, 335 etc) is sufficient to marry them together, but that's not very clear in your example.
Your first question is just a regexp, which I think you've got the general hang of. I think the issue is how you've gone about capturing your data.
To do the second thing you ask, capture the hi and lo values in your first pass, and store them.
I wasn't planning on writing a complete solution, yet here it is...
use strict;
use warnings;
my $metadata = {}; # hashref to store CDS info in..
my $group = {}; # hashref to store summary/detail in..
my $arrow = { "->" => '+', "<-" => '-' }; # decode arrow to pos/neg
open(FH,"$ARGV[0]");
while(<FH>){
chomp;
next if /^#/;
my @array=split("\t");
my $key = join(":", $array[0], $array[2]);
if ($array[1] =~ /CDS/){
$metadata->{$key} = $array[7];
}
if ($array[1] =~ /extracted region/){
#assert CDS already processed..
die "No CDS record for $key!\n" unless $metadata->{$key};
(my $label = $array[7]) =~ s/.*region from (.*)\|;.*/$1/;
$label =~ s/\|/_/g;
$group->{$label} ||= { #seed summary if not exists
pos1 => 1e10,
pos2 => 0,
metadata => $metadata->{$key},
sequences => [],
};
(my $pos1, my $arr, my $pos2) = ($array[7]=~/.*interval=\"(\d+) (<?->?) (\d+)\"$/gm);
# capture hi/lo values for group
$group->{$label}->{pos1} = $pos1 if $pos1 < $group->{$label}->{pos1};
$group->{$label}->{pos2} = $pos2 if $pos2 > $group->{$label}->{pos2};
# push this sequence onto the group's array
push(@{ $group->{$label}->{sequences} }, [ $pos1, $pos2, $arrow->{$arr} ]);
}
}
for my $gene (sort keys %{ $group }){
#write out header
printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
$gene, 'gene',
$group->{$gene}->{pos1}, $group->{$gene}->{pos2},
$group->{$gene}->{sequences}->[0]->[2],
$group->{$gene}->{metadata};
foreach my $sequence ( @{ $group->{$gene}->{sequences} } ){
# write out details
printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
$gene, 'CDS',
$sequence->[0], $sequence->[1], $sequence->[2],
$group->{$gene}->{metadata};
}
}
print "###\n";
I hope that's sufficiently commented to make sense. Written like this, it will be a lot easier to maintain if you have to come back and modify it after six months has passed.
UPDATE
I've modified this code following the 4th comment. The $array[7] regexp block now captures the value of $array[5] and stores it in the sequence arrayref.
UPDATE #2
->
and <-
to the symbols you require. (line 6)I think we've strayed off the Q&A highway and into the Free Software Development Bureau now. What I've written is not complex code, it has comments and structure. It's time for you to come to grips with the logic of it. And up-vote my answer.
Upvotes: 2