Reputation: 13062
I have several data files that are tab delimited. I need to extract all the unique values in a certain column of these data files (say column 25) and write these values into an output file for further processing. How might I do this in Perl? Remember I need to consider multiple files in the same folder.
edit: The code I've done thus far is like this.
#!/usr/bin/perl
use warnings;
use strict;
my @hhfilelist = glob "*.hh3";
for my $f (@hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my @line = split /\t/;
print "field is $line[24]\n";
}
close (F);
}
The question is how do I efficiently create the hash/array of unique values as I read each line of each file. Or is it faster if I populate the whole array and then remove duplicates?
Upvotes: 1
Views: 3378
Reputation: 118605
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' inputs > output
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' *.hh3 > output
Command-line switches -F/\\t/ -an
mean iterate through every line in every input file and split the line on the tab character into the array @F
.
$F[24]
refers to the value in the 25-th field of each line (between the 24-th and 25-th tab characters)
$seen{...}
is a hashtable to keep track of which values have already been observed.
The first time a value is observed, $seen{VALUE}
is 0 so Perl will execute the statement print"$F[24]\n"
. Every other time the value is observed, $seen{VALUE}
will be non-zero and the statement won't be executed. This way each unique value gets printed out exactly once.
In a similar context to your larger script:
my @hhfilelist = glob "*.hh3";
my %values_in_field_25 = ();
for my $f (@hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
my @F = split /\t/;
$values_in_field_25{$F[24]} = 1;
}
close (F);
}
my @unique_values_in_field_25 = keys %values_in_field_25; # or sort keys ...
Upvotes: 2
Reputation: 129393
For Perl solution, please use Text::CSV
module to parse flat (X-separated) files - the constructor accepts a parameter specifying a separator character. Do this for every file in a loop, with file list generated by either glob()
for files in a given directory or File::Find
for subdirectories as well
Then, to get the unique values, for each row, store the column #25 in a hash.
E.g. after retrieving the values:
$colref = $csv->getline($io);
$unique_values_hash{ $colref->[24] } = 1;
Then, iterate over hash keys and print to a file.
For non-Perl shell solution, you can simply do:
cat MyFile_pattern | awk -F'\t' 'print $25' |sort -u > MyUniqueValuesFile
You can replace awk
with cut
Please note that non-Perl solution only works if the files don't contain TABs in the fields themselves and the columns aren't quoted.
Upvotes: 3
Reputation: 74222
Some tips on how to handle the problem:
glob
: glob '.* *'
File::Find
's find
functionText::CSV
with \t
character as the delimiter, extract wanted values and write to fileUpvotes: 3