Reputation: 6578
I have a dataset of the combined attempts of several different approaches
(approaches 1 to 3
) to identify positions in a genome:
source chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
attempt1 2L 22220720 2L 22255744
attempt1 3L 15568694 3L 15568866
attempt3 3R 14006279 3R 14008254
attempt1 3R 14006281 3R 14008253
attempt2 3R 14006282 3R 14008254
attempt3 3R 14006286 3R 14008254
attempt1 3R 32060908 3R 32061196
attempt1 3R 32066206 3R 32068392
attempt3 3R 32066206 3R 32068392
attempt2 3R 32066207 3R 32068393
attempt2 X 4574312 X 4576608
attempt1 X 4574313 X 4576607
attempt3 X 4574313 X 4576608
I want to find and group positions that each attempt has identified, allowing a bit of room for error. For example, I would like to class the first two lines...
source chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
...as a single event (event 1
), that has been identified by two different attempts (attempt1
and attempt2
). I want to class such instances as single events only if different attempts:
bp1
+/-
5 (i.e. within the window 5890200..5890210
)chromosome1
and chromosome2
(2L
)bp2
+/-
5 (i.e. within the window 5890715..5890725
)I have tried to achieve this using a each chromosome and bp as a separate key in a hash
my %SVs;
my $header;
# Make hash
while(<$in>){
chomp;
if ($. == 1){
$header = $_;
next;
}
my ($source, $chromosome1, $bp1, $chromosome2, $bp2) = split;
push @{$SVs{$chromosome1}{$bp1}{$chromosome2}{$bp2}}, $_;
}
}
... and then using a sliding window approach around each bp1 and bp2 value for each line:
my %events;
for my $chr1 ( sort keys %SVs ){
for my $bp1 ( sort { $a <=> $b } keys $SVs{$chr1} ){
my $w1_start = ( $bp1 - 5 );
my $w1_end = ( $bp1 + 5 );
my $window1 = "$w1_start-$w1_end";
for my $chr2 ( sort keys $SVs{$chr1}{$bp1} ){
for my $bp2 ( sort { $a <=> $b } keys $SVs{$chr1}{$bp1}{$chr2} ){
my $w2_start = ( $bp2 - 5 );
my $w2_end = ( $bp2 + 5 );
my $window2 = "$w2_start-$w2_end";
for ( $w1_start .. $w1_end ){
if ($bp1 == $_){
push @{$events{$chr1}{$window1}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
}
}
for ( $w2_start .. $w2_end ){
if ($bp2 == $_){
push @{$events{$chr2}{$window2}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
}
}
}
}
}
}
print Dumper \%events;
This achieves part of what I'm want, but I can't work out how get my desired output:
event source chromosome1 bp1 chromosome2 bp2
1 attempt1 2L 5890205 2L 5890720
1 attempt2 2L 5890205 2L 5890721
2 attempt1 2L 22220720 2L 22255744
3 attempt1 3L 15568694 3L 15568866
4 attempt3 3R 14006279 3R 14008254
4 attempt1 3R 14006281 3R 14008253
4 attempt2 3R 14006282 3R 14008254
4 attempt3 3R 14006286 3R 14008254
5 attempt1 3R 32060908 3R 32061196
6 attempt1 3R 32066206 3R 32068392
6 attempt3 3R 32066206 3R 32068392
6 attempt2 3R 32066207 3R 32068393
7 attempt2 X 4574312 X 4576608
7 attempt1 X 4574313 X 4576607
7 attempt3 X 4574313 X 4576608
Upvotes: 3
Views: 82
Reputation: 118128
The below defines each equivalence class by the last entry added to the equivalence class (based on my understanding of your comment above):
#!/usr/bin/env perl
use strict;
use warnings;
run(\*DATA);
sub run {
my $fh = shift;
my @header = split ' ', scalar <$fh>;
my @events = ([ get_next_event($fh, \@header)]);
while (my $event = get_next_event($fh, \@header)) {
# change the -1 in the second subscript to 0
# if you want to always compare to the first
# event added to the equivalence class
if (same_event($events[-1][-1], $event, 5)) {
push @{ $events[-1] }, $event;
next;
}
push @events, [ $event ];
}
print join("\t", event => @header), "\n";
for my $i (1 .. @events) {
for my $ev (@{ $events[$i - 1] }) {
print join("\t", $i, @{$ev}{@header}), "\n";
}
}
}
sub get_next_event {
my $fh = shift;
my $header = shift;
return unless defined(my $line = <$fh>);
return unless $line =~ /\S/;
my %event;
@event{ @$header } = split ' ', $line;
return \%event;
}
sub same_event {
my ($x, $y, $threshold) = @_;
return if $x->{chromosome1} ne $y->{chromosome1};
return if abs($x->{bp1} - $y->{bp1}) > $threshold;
return if abs($x->{bp2} - $y->{bp2}) > $threshold;
return 1;
}
__DATA__
source chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
attempt1 2L 22220720 2L 22255744
attempt1 3L 15568694 3L 15568866
attempt3 3R 14006279 3R 14008254
attempt1 3R 14006281 3R 14008253
attempt2 3R 14006282 3R 14008254
attempt3 3R 14006286 3R 14008254
attempt1 3R 32060908 3R 32061196
attempt1 3R 32066206 3R 32068392
attempt3 3R 32066206 3R 32068392
attempt2 3R 32066207 3R 32068393
attempt2 X 4574312 X 4576608
attempt1 X 4574313 X 4576607
attempt3 X 4574313 X 4576608
Output:
event source chromosome1 bp1 chromosome2 bp2
1 attempt1 2L 5890205 2L 5890720
1 attempt2 2L 5890205 2L 5890721
2 attempt1 2L 22220720 2L 22255744
3 attempt1 3L 15568694 3L 15568866
4 attempt3 3R 14006279 3R 14008254
4 attempt1 3R 14006281 3R 14008253
4 attempt2 3R 14006282 3R 14008254
4 attempt3 3R 14006286 3R 14008254
5 attempt1 3R 32060908 3R 32061196
6 attempt1 3R 32066206 3R 32068392
6 attempt3 3R 32066206 3R 32068392
6 attempt2 3R 32066207 3R 32068393
7 attempt2 X 4574312 X 4576608
7 attempt1 X 4574313 X 4576607
7 attempt3 X 4574313 X 4576608
Upvotes: 4