Rob
Rob

Reputation: 175

Parsing file based on column ID: perl

I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:

    AAAAAAAAAA1     m081216|101|123
    AAAAAAAAAA1     m081216|100|1987
    AAAAAAAAAA1     m081216|927|463729
    BBBBBBBBBB2     m081216|254|260489
    BBBBBBBBBB2     m081216|475|1234
    BBBBBBBBBB2     m081216|987|240
    CCCCCCCCCC3     m081216|433|1000
    CCCCCCCCCC3     m081216|902|366 
    CCCCCCCCCC3     m081216|724|193 

For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:

    m081216|101|123
    m081216|100|1987
    m081216|927|463729

I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure. Here is my code so far.

    use warnings;
    use strict;
    use List::MoreUtils 'true';

    open(IN, "<", "/path/to/in_file") or die $!;

    my @array;
    my $queryID;

    while(<IN>){
            chomp;
            my $OutputLine = $_;
            processOutputLine($OutputLine);
    }


    sub processOutputLine {
            my ($OutputLine) = @_;
            my @Columns = split("\t", $OutputLine);
            my ($queryID, $target) = @Columns;
            push(@array, $target, "\n") unless grep{$queryID eq $_} @array;
            my $delineator = "\n";
            my $count = true { /$delineator/g } @array;
            open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
            foreach(@array){
                    print OUT @array;
            }
     }

Upvotes: 0

Views: 88

Answers (1)

zdim
zdim

Reputation: 66883

I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.

use warnings;
use strict;
use feature qw(say);

my $filename = 'rep_seqs.txt';   # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";

my %seqs;
foreach my $line (<$in_fh>) {
    chomp $line;
    my ($id, $seq) = split /\t/, $line;
    push @{$seqs{$id}}, $seq;
}
close $in_fh;

my $out_fh;
for (sort keys %seqs) {
    my $outfile = $_ . '_' . scalar @{$seqs{$_}} . '.txt';
    open $out_fh, '>', $outfile  or do {
        warn "Can't open $outfile: $!";
        next;
    };
    say $out_fh $_ for @{$seqs{$_}};
}
close $out_fh;

With your input I get the desired files, named AA..._count.txt, with their corresponding three lines each. If items separated by | should be split you can do that while writing it out, for example.

Comments

  • The anonymous array for a key $seqs{$id} is created once we push, if not there already

  • If there are issues with tabs (converted to spaces?), use ' '. See the comment.

  • A filehandle is closed and re-opened on every open, so no need to close every time


The default pattern for split is ' ', also triggering specific behavior -- it matches "any contiguous whitespace", and also omits leading whitespace. (The pattern / / matches a single space, turning off this special behavior of ' '.) See a more precise description on the split page. Thus it is advisable to use ' ' when splitting on unspecified number of spaces, since in the case of split this is a bit idiomatic, is perhaps the most common use, and is its default. Thanks to Borodin for prompting this comment and update (the original post had the equivalent /\s+/).

Note that in this case, since ' ' is the default along with $_, we can shorten it a little

for (<$in_fh>) {
    chomp;
    my ($id, $seq) = split;
    push @{$seqs{$id}}, $seq;
}

Upvotes: 3

Related Questions