Sos
Sos

Reputation: 1949

Perl how to extract all words matching a pattern and build processed lists

So I have a file "Myoutput test.txt" of the form

#Some comments
#some more comments
A X word123_0988b 0.00132 -123.4 567
T E word123_0988b 0.00456 -231.4 897
H D word123_0988b 1.3132 -120.2 757
F Y word234_09876b 0.1231 -12344 789
A T word234_09876b 0.34531 -144 789
F Y word234_09876b 0.1231 -12344 789
G L word890_0987a 0.00012 -12312 654

And I want to build a list of the form

{{word123_0988b,A,T,H},{word234_09876b,F,A,F},{word890_0987a,G}}

where the first position of each sublist is the identifier in the 3rd column, and the other letters are all of the letters in the first column which this identifier is associated with.

To do this I was thinking about doing this:

  1. Extract all identifiers in the 3rd column, delete duplicates;
  2. For each identifier, select all lines with this identifier, extract the elements in column 1, push these into an array of the form {identified,1stcol1,2ndcol1,3rdcol1}.

However, I can't even do the 1st point. Here's where I got until now:

#!/usr/local/bin/perl
use strict;
use warnings;

my $dir='D:\test';
my ($out,$file);

open $out,"<", "$dir\\Myoutput test.txt" or die "problem opening out $!";

my @file = grep (!/^#/,<$out>); #ignores commented lines

while ($file =~ /(\w*word\w*)/g){
    print "$1\n"; #would print all words matching "word"
}

close $out;

Could someone give me some tips or any guidance on how to do this? Thank you so much!

Upvotes: 1

Views: 486

Answers (2)

Kenosis
Kenosis

Reputation: 6204

When you:

my @file = grep (!/^#/,<$out>); 

you're forcing the creation of a complete list of the file's lines, just to skip those which begin with #. Typically, this is handled in a while loop, so only one line at a time is read from the file, and skipped if not wanted.

The data structure that would help here is a hash of arrays (HoA), where the keys are the identifiers and the values are references to lists of column 1 letters. Here's how this can be done:

use strict;
use warnings;

my %hash;
local $" = ',';

while (<DATA>) {
    next if /^#/;
    my @cols = split ' ', $_, 4;
    push @{ $hash{ $cols[2] } }, $cols[0];
}

print '{';
print "{$_,@{ $hash{$_} }}" for sort keys %hash;
print '}';

__END__
#Some comments
#some more comments
A X word123_0988b 0.00132 -123.4 567
T E word123_0988b 0.00456 -231.4 897
H D word123_0988b 1.3132 -120.2 757
F Y word234_09876b 0.1231 -12344 789
A T word234_09876b 0.34531 -144 789
F Y word234_09876b 0.1231 -12344 789
G L word890_0987a 0.00012 -12312 654

Output:

{{word123_0988b,A,T,H}{word234_09876b,F,A,F}{word890_0987a,G}}

The local $" = ','; notation makes , print between array elements when the array is interpolated (printed within a string). Each line is split setting split's LIMIT to 4, since only the first three columns are significant (splitting terminates after the third column). The push line creates the HoA. Finally, the HoA is printed.

Hope this helps!

Upvotes: 3

hmatt1
hmatt1

Reputation: 5139

The problem is that you aren't iterating through your array @file. You declared $file when you declared $out, so that's why you don't get any errors. You'll want to cycle through the array using a for loop instead. Try something like this:

#!/usr/local/bin/perl
use strict;
use warnings;

my $out;

open $out,"<", "test.txt" or die "problem opening out $!";

my @file = grep (!/^#/,<$out>); #ignores commented lines

for my $file (@file) {
        if ( $file =~ /(\w*word\w*)/g) {
                print "$1\n"; #would print all words matching "word"
        }
}
close $out;

I changed the open statement so you'll have to change it back to your input file. Hopefully this gets you past being stuck on the first point. The output looks like:

matt@mattpc:~/Documents/test/4$ perl test.pl 
word123_0988b
word123_0988b
word123_0988b
word234_09876b
word234_09876b
word234_09876b
word890_0987a

Upvotes: 3

Related Questions