Sos
Sos

Reputation: 1949

Extract information from lines and columns in PERL

I have a huge file with multiple lines and columns. Each line has many columns and many lines have the same name in the same position. E.g.

 A  C  Z  Y  X
 A  C  E  J
 B  E  K  L  M

What is the best way to Find all lines that share the same items in a certain position? For instance, I would like to know that there are 2 A, 2 C, 1 D, etc., all ordered by column.

I am really new to Perl, and so I am struggling a lot to advance in this so any tips are appreciated.

I got to this point:

#!/usr/local/bin/perl -w

use strict; 

my $path='My:\Path\To\My\File.txt';
my $columns;
my $line;

open (FILE,$path), print "Opened!\n" or die ("Error opening");

while (<FILE>)
{
@line=split('\t',$_);
}

close FILE;

The output of this can be another TSV, that examines the file only until the 5th column, ordered from top to bottom, like:

 A  2
 C  2
 Z  1
 Y  1
 E  1
 J  1
 B  1
 E  1
 K  1
 L  1

Note that the first items appear first and, when shared among lines, do not show again for subsequent lines.

Edit: as per the questions in the comments, I changed the dataset and output. Note that two E appear: one belonging to the third column, the other belonging to the second column.

Edit2: Alternatively, this could also be analyzed column by column, thus showing the results in the first column, then in the second, and so on, as long as they were clearly separated. Something like

 "1st" "col"
 A 2
 B 1
 "2nd" "col"
 C 2
 E 1
 "3rd" "col"
 Z 1
 E 1
 K 1
 "4th" "col"
 Y 1
 J 1
 L 1

Upvotes: 0

Views: 138

Answers (2)

Kenosis
Kenosis

Reputation: 6204

Perhaps the following will be helpful:

use strict;
use warnings;

my $path = 'My:\Path\To\My\File.txt';
my %hash;

open my $fh, '<', $path or die $!;

while (<$fh>) {
    my @cols = split ' ', $_, 5;
    $hash{$_}{ $cols[$_] || '' }++ for 0 .. 3;
}

close $fh;

for my $key ( sort { $a <=> $b } keys %hash ) {
    print "Col ", $key + 1, "\n";
    print "$_ $hash{$key}{$_}\n"
      for sort { $hash{$key}->{$b} <=> $hash{$key}->{$a} } grep $_,
      keys %{ $hash{$key} };
}

Output on your dataset:

Col 1
A 2
B 1
Col 2
C 2
E 1
Col 3
Z 1
K 1
E 1
Col 4
J 1
L 1
Y 1

Upvotes: 0

Miller
Miller

Reputation: 35198

I did not fully understand the formatting of your desired output, so the below script outputs all the data from the first col on the first row, and so on. This can easily be modified to the format that you desire, but is a quick starting point to how to acummulate the data first and then processing it.

use strict; 
use warnings;
use autodie;

my $path='My:\Path\To\My\File.txt';

open my $fh, '<', $path;

my @data;

# while (<$fh>) { Switch these lines when ready for real data
while (<DATA>) {
    my @row = split ' ';
    for my $col (0..$#row) {
        $data[$col]{$row[$col]}++;
    }
}

for my $coldata (@data) {
    for my $letter (sort keys %$coldata) {
        print "$letter $coldata->{$letter} ";
    }
    print "\n";
}

close $fh;

__DATA__
A  C  Z  Y  X
A  C  D  J
B  E  K  L  M

Outputs

A 2 B 1
C 2 E 1
D 1 K 1 Z 1
J 1 L 1 Y 1
M 1 X 1

Upvotes: 1

Related Questions