sfactor
sfactor

Reputation: 13062

Extracting unique values from multiple files in Perl

I have several data files that are tab delimited. I need to extract all the unique values in a certain column of these data files (say column 25) and write these values into an output file for further processing. How might I do this in Perl? Remember I need to consider multiple files in the same folder.

edit: The code I've done thus far is like this.

#!/usr/bin/perl                   

use warnings;
use strict;

my @hhfilelist  = glob "*.hh3";

for my $f (@hhfilelist) {
  open F, $f || die "Cannot open $f: $!";
  while (<F>) {
    chomp;
    my @line = split /\t/;   

    print "field is $line[24]\n";
  }
  close (F);
}

The question is how do I efficiently create the hash/array of unique values as I read each line of each file. Or is it faster if I populate the whole array and then remove duplicates?

Upvotes: 1

Views: 3378

Answers (3)

mob
mob

Reputation: 118605

perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' inputs > output

perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' *.hh3 > output

Command-line switches -F/\\t/ -an mean iterate through every line in every input file and split the line on the tab character into the array @F.

$F[24] refers to the value in the 25-th field of each line (between the 24-th and 25-th tab characters)

$seen{...} is a hashtable to keep track of which values have already been observed. The first time a value is observed, $seen{VALUE} is 0 so Perl will execute the statement print"$F[24]\n". Every other time the value is observed, $seen{VALUE} will be non-zero and the statement won't be executed. This way each unique value gets printed out exactly once.


In a similar context to your larger script:

my @hhfilelist  = glob "*.hh3";
my %values_in_field_25 = ();
for my $f (@hhfilelist) {
  open F, $f || die "Cannot open $f: $!";
  while (<F>) {
    my @F = split /\t/;
    $values_in_field_25{$F[24]} = 1;
  }
  close (F);
}

my @unique_values_in_field_25 = keys %values_in_field_25; # or sort keys ...

Upvotes: 2

DVK
DVK

Reputation: 129393

For Perl solution, please use Text::CSV module to parse flat (X-separated) files - the constructor accepts a parameter specifying a separator character. Do this for every file in a loop, with file list generated by either glob() for files in a given directory or File::Find for subdirectories as well

Then, to get the unique values, for each row, store the column #25 in a hash.

E.g. after retrieving the values:

 $colref = $csv->getline($io);
 $unique_values_hash{ $colref->[24] } = 1;

Then, iterate over hash keys and print to a file.


For non-Perl shell solution, you can simply do:

cat MyFile_pattern | awk -F'\t' 'print $25' |sort -u > MyUniqueValuesFile

You can replace awk with cut

Please note that non-Perl solution only works if the files don't contain TABs in the fields themselves and the columns aren't quoted.

Upvotes: 3

Alan Haggai Alavi
Alan Haggai Alavi

Reputation: 74222

Some tips on how to handle the problem:

  • Find files
    • For finding files within a directory, use glob: glob '.* *'
    • For finding files within a directory tree, use File::Find's find function
  • Open each file, use Text::CSV with \t character as the delimiter, extract wanted values and write to file

Upvotes: 3

Related Questions