Reputation: 149
I'm using Perl to generate a list of unique exons (which are the units of genes).
I've generated a file in this format (with hundreds of thousands of lines):
chr1 1000 2000 gene1
chr1 3000 4000 gene2
chr1 5000 6000 gene3
chr1 1000 2000 gene4
Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.
Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).
I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.
Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:
for (my $i = 0; $i < scalar @exons; $i++) {
my @temp_line = line_splitter($exons[$i]); # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar @exons_dup; $j++) {
my @inner_temp_line = line_splitter($exons_dup[$j]); # runs subroutine turning scalar into array
unless (($temp_line[1] == $inner_temp_line[1]) && # this loop ensures that the the loop
($temp_line[3] eq $inner_temp_line[3])) { # below skips the identical lines
if (($temp_line[1] == $inner_temp_line[1]) && # if the coordinates are the same
($temp_line[2] == $inner_temp_line[2])) { # between the comparisons
splice(@exons, $i, 1); # delete the first one
}
}
}
}
Upvotes: 6
Views: 1639
Reputation: 3754
As simple as it comes. I tried to use as little magic as possible.
my %exoms = ();
my $input;
open( $input, '<', "lines.in" ) or die $!;
while( <$input> )
{
if( $_ =~ /^(\w+\s+){3}(\w+)$/ ) #ignore lines that are not in expected format
{
my @splits = split( /\s+/, $_ ); #split line in $_ on multiple spaces
my $key = $splits[1] . '_' . $splits[2];
if( !exists( $exoms{$key} ) )
{
#could output or write to a new file here, probably output to a file
#for large sets.
$exoms{$key} = \@splits;
}
}
}
#demo to show what was parsed from demo input
while( my ($key, $value) = each(%exoms) )
{
my @splits = @{$value};
foreach my $position (@splits)
{
print( "$position " );
}
print( "\n" );
}
Upvotes: 0
Reputation: 3696
my @exons = (
'chr1 1000 2000 gene1',
'chr1 3000 4000 gene2',
'chr1 5000 6000 gene3',
'chr1 1000 2000 gene4'
);
my %unique_exons = map {
my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
"$chro $scoor $ecoor" => $gene
} @exons;
print "$_ $unique_exons{$_} \n" for keys %unique_exons;
This will give you uniqueness, and the last gene name will be included. This results in:
chr1 1000 2000 gene4
chr1 5000 6000 gene3
chr1 3000 4000 gene2
Upvotes: 7
Reputation: 1904
You can use a hash to keep track of duplicates you've already seen and then skip them. This example assumes the fields in your input file are space-delimited:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
while (my $line = <>) {
my($chromosome, $exon_start, $exon_end, $gene) = split /\s+/, $line;
my $key = join ':', $chromosome, $exon_start, $exon_end;
if ($seen{$key}) {
next;
}
else {
$seen{$key}++;
print $line;
}
}
Upvotes: 1
Reputation: 98388
You can use a hash to dedup en passant, but you need a way to join the parts you want to use to detect duplicates into a single string.
sub extract_dup_check_string {
my $exon = shift;
my @parts = line_splitter($exon);
# modify to suit:
my $dup_check_string = join( ';', @parts[0..2] );
return $dup_check_string;
}
my %seen;
@deduped_exons = grep !$seen{ extract_dup_check_string($_) }++, @exons;
Upvotes: 3