Perl how to extract all words matching a pattern and build processed lists

Question

So I have a file "Myoutput test.txt" of the form

#Some comments
#some more comments
A X word123_0988b 0.00132 -123.4 567
T E word123_0988b 0.00456 -231.4 897
H D word123_0988b 1.3132 -120.2 757
F Y word234_09876b 0.1231 -12344 789
A T word234_09876b 0.34531 -144 789
F Y word234_09876b 0.1231 -12344 789
G L word890_0987a 0.00012 -12312 654

And I want to build a list of the form

{{word123_0988b,A,T,H},{word234_09876b,F,A,F},{word890_0987a,G}}

where the first position of each sublist is the identifier in the 3rd column, and the other letters are all of the letters in the first column which this identifier is associated with.

To do this I was thinking about doing this:

Extract all identifiers in the 3rd column, delete duplicates;
For each identifier, select all lines with this identifier, extract the elements in column 1, push these into an array of the form {identified,1stcol1,2ndcol1,3rdcol1}.

However, I can't even do the 1st point. Here's where I got until now:

#!/usr/local/bin/perl
use strict;
use warnings;

my $dir='D:	est';
my ($out,$file);

open $out,"<", "$dir\Myoutput test.txt" or die "problem opening out $!";

my @file = grep (!/^#/,<$out>); #ignores commented lines

while ($file =~ /(\w*word\w*)/g){
    print "$1
"; #would print all words matching "word"
}

close $out;

Could someone give me some tips or any guidance on how to do this? Thank you so much!

Kenosis · Accepted Answer

When you:

my @file = grep (!/^#/,<$out>);

you're forcing the creation of a complete list of the file's lines, just to skip those which begin with #. Typically, this is handled in a while loop, so only one line at a time is read from the file, and skipped if not wanted.

The data structure that would help here is a hash of arrays (HoA), where the keys are the identifiers and the values are references to lists of column 1 letters. Here's how this can be done:

use strict;
use warnings;

my %hash;
local $" = ',';

while () {
    next if /^#/;
    my @cols = split ' ', $_, 4;
    push @{ $hash{ $cols[2] } }, $cols[0];
}

print '{';
print "{$_,@{ $hash{$_} }}" for sort keys %hash;
print '}';

__END__
#Some comments
#some more comments
A X word123_0988b 0.00132 -123.4 567
T E word123_0988b 0.00456 -231.4 897
H D word123_0988b 1.3132 -120.2 757
F Y word234_09876b 0.1231 -12344 789
A T word234_09876b 0.34531 -144 789
F Y word234_09876b 0.1231 -12344 789
G L word890_0987a 0.00012 -12312 654

Output:

{{word123_0988b,A,T,H}{word234_09876b,F,A,F}{word890_0987a,G}}

The local $" = ','; notation makes , print between array elements when the array is interpolated (printed within a string). Each line is split setting split's LIMIT to 4, since only the first three columns are significant (splitting terminates after the third column). The push line creates the HoA. Finally, the HoA is printed.

Hope this helps!

Perl how to extract all words matching a pattern and build processed lists

Answers (2)

Related Questions