Reputation: 175
I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:
AAAAAAAAAA1 m081216|101|123
AAAAAAAAAA1 m081216|100|1987
AAAAAAAAAA1 m081216|927|463729
BBBBBBBBBB2 m081216|254|260489
BBBBBBBBBB2 m081216|475|1234
BBBBBBBBBB2 m081216|987|240
CCCCCCCCCC3 m081216|433|1000
CCCCCCCCCC3 m081216|902|366
CCCCCCCCCC3 m081216|724|193
For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:
m081216|101|123
m081216|100|1987
m081216|927|463729
I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure. Here is my code so far.
use warnings;
use strict;
use List::MoreUtils 'true';
open(IN, "<", "/path/to/in_file") or die $!;
my @array;
my $queryID;
while(<IN>){
chomp;
my $OutputLine = $_;
processOutputLine($OutputLine);
}
sub processOutputLine {
my ($OutputLine) = @_;
my @Columns = split("\t", $OutputLine);
my ($queryID, $target) = @Columns;
push(@array, $target, "\n") unless grep{$queryID eq $_} @array;
my $delineator = "\n";
my $count = true { /$delineator/g } @array;
open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
foreach(@array){
print OUT @array;
}
}
Upvotes: 0
Views: 88
Reputation: 66883
I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.
use warnings;
use strict;
use feature qw(say);
my $filename = 'rep_seqs.txt'; # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";
my %seqs;
foreach my $line (<$in_fh>) {
chomp $line;
my ($id, $seq) = split /\t/, $line;
push @{$seqs{$id}}, $seq;
}
close $in_fh;
my $out_fh;
for (sort keys %seqs) {
my $outfile = $_ . '_' . scalar @{$seqs{$_}} . '.txt';
open $out_fh, '>', $outfile or do {
warn "Can't open $outfile: $!";
next;
};
say $out_fh $_ for @{$seqs{$_}};
}
close $out_fh;
With your input I get the desired files, named AA..._count.txt
, with their corresponding three lines each. If items separated by |
should be split you can do that while writing it out, for example.
Comments
The anonymous array for a key $seqs{$id}
is created once we push
, if not there already
If there are issues with tabs (converted to spaces?), use ' '
. See the comment.
A filehandle is closed and re-opened on every open
, so no need to close every time
The default pattern for split
is ' '
, also triggering specific behavior -- it matches "any contiguous whitespace", and also omits leading whitespace. (The pattern / /
matches a single space, turning off this special behavior of ' '
.) See a more precise description on the split
page. Thus it is advisable to use ' '
when splitting on unspecified number of spaces, since in the case of split
this is a bit idiomatic, is perhaps the most common use, and is its default. Thanks to Borodin for prompting this comment and update (the original post had the equivalent /\s+/
).
Note that in this case, since ' '
is the default along with $_
, we can shorten it a little
for (<$in_fh>) {
chomp;
my ($id, $seq) = split;
push @{$seqs{$id}}, $seq;
}
Upvotes: 3