Reputation: 2157
This is a description of my problem: I have two text files (here $variants
and $annotation
). I want to check if the value from column 2 in $variants
lies between the values from column 2 and 3 in $annotation
. If this is true then the value from column 1 in $annotation
should be added to a new column in $variants
.
This is how my sample input files look like
$annotation
represents a tab-delimited text file
These values can be overlapping and cannot be perfectly sorted, since I'm working with a circular genome
C0 C1 C2
gene1 0 100
gene2 500 1000
gene3 980 1200
gene4 1500 5
$variants
represents a tab-delimited text file
C0 C1
... 5
... 10
... 100
... 540
... 990
The output should look like this ($variants
with two other columns added)
C0 C1 C2 C3
... 5 gene1 gene4
... 10 gene1
... 100 gene1
... 540 gene2
... 990 gene2 gene3
This is how my script looks like for the moment
my %hash1=();
while(<$annotation>){
my @column = split(/\t/); #split on tabs
my $keyfield = $column[1] && $column[2]; # I need to remember values from two columns here. How do I do that?
}
while(<$variants>){
my @column=split(/\t/); # split on tabs
my $keyfield = $column[1];
if ($hash1{$keyfield} >= # so the value in column[1] should be between the values from column[1] & [2] in $annotation
push # if true then add values from column[0] in $annotation to new column in $variants
}
So my biggest problems are how to remember two values in a file using hashes and how to put a value from one file to a column in another file. Could someone help me with this?
Upvotes: 3
Views: 588
Reputation: 241858
If the input files are not large and the positions are not too high, you can use arrays to represent all positions:
#!/usr/bin/perl
use warnings;
use strict;
sub skip_header {
my $FH = shift;
<$FH>;
}
open my $ANN, '<', 'annotation' or die $!;
my $max = 0;
while (<$ANN>) {
$_ > $max and $max = $_ for (split)[1, 2];
}
seek $ANN, 0, 0; # Rewind the file back.
my $circular;
my @genes;
while (<$ANN>) {
my ($gene, $from, $to) = split;
if ($from <= $to) {
$genes[$_] .= "$gene " for $from .. $to;
} else {
$circular = 1;
$genes[$_] .= "$gene " for 0 .. $to, $from .. $max + 1;
}
}
chop @genes;
open my $VAR, '<', 'variants' or die $!;
skip_header($VAR);
while (<$VAR>) {
next if /^\s*#/;
chomp;
my ($str, $pos) = split;
$pos = $#genes if $circular and $pos > $#genes;
print "$_ ", $genes[$pos] // q(), "\n";
}
Upvotes: 1
Reputation: 241858
No hashing needed at all. This example expects the annotations to be sorted and not overlapping, it also works only if all the values from variants should be printed.
#!/usr/bin/perl
use warnings;
use strict;
open my $VAR, '<', 'variants' or die $!;
<$VAR>; # skip header
my ($str, $pos) = split ' ', <$VAR>;
open my $ANN, '<', 'annotation' or die $!;
<$ANN>; # skip header
while (<$ANN>) {
my ($gene, $from, $to) = split;
while ($from <= $pos and $pos <= $to) {
print "$str $pos $gene\n";
($str, $pos) = split ' ', <$VAR> or last;
}
}
Upvotes: 1