Reputation: 77
I have two types of tab separated input files, the first is a matrix which has names listed vertically in the first column, and numerical values in subsequent columns. The second type of input contains a single column with a subset of the same names listed in the first column of the first file type.
EX: input1
Gary 1 2 3
Yolanda 3 4 5
Biff 5 6 7
Hubert 8 9 10
EX: input2
Gary
Biff
While there are several different variations on input2, there is only a single input1. I have a perl script with an embedded awk command which is supposed to match names from input2 to input1 and print an output file which contains the names from input2 and the respective values from input1.
EX: outputfile
Gary 1 2 3
Biff 5 6 7
Here is my code:
#!/usr/bin/perl
use strict;
use warnings;
my $dir1 = '../FeatureSelection/Chunks/ArffPreprocessing';
my $dir2 = '../DataFiles';
opendir(DIR, $dir1) or die $!;
while (my $file = readdir(DIR)) {
# We only want files
next unless (-f "$dir1/$file");
# Use a regular expression to find files with .txt
next unless ($file =~ m/\.txt/);
my @partialName = (split /\./, $file);
#The $matchingFile is the file which contains attributes listed vertically, along side their respective data
my $matchingFile = "$dir2/input1\.txt ";
system("awk -F\"\t\" 'FILENAME==\"$dir1/$file\"{a[\$1]=\$1} FILENAME==\"$matchingFile\"{if(a[\$1]){print \$0}}' $dir1/$file $matchingFile > $dir1/$partialName[0]'\_matched.out' ");
}
closedir(DIR);
exit 0;
This is the line works on the command line, but it refuses to work in my perl script.
awk -F"\t" 'FILENAME=="input2.txt"{a[$1]=$1} FILENAME=="../../../DataFiles/input1.txt"{if(a[$1]){print $0}}' input2.txt ../../../DataFiles/input1.txt > input2_matched.out
By the way, the sheer number of input2 files makes hard coding the above awk line on the command propt a real pain in the butt, which is why I have utilized a perl script which can perform my desired function on every input2 file in the directory, AND keep the naming convention for the output files. I've written similar programs so I know the syntax of
system("awk ...blah blah... ");
can and does work properly.
I've been stuck on this problem for days now, so any help would be most appreciated!
Upvotes: 3
Views: 782
Reputation: 21955
While there are several different variations on input2, there is only a single input1. I have a perl script with an embedded awk command which is supposed to match names from input2 to input1 and print an output file which contains the names from input2 and the respective values from input1.
I would suggest find
+ a comparison function
to achieve your objective
matcher(){
awk 'NR==FNR{input1record[$1]=$0;next}
$1 in input1record{print input1record[$1]}' /path/to/input1 "$@" >> /path/to/result
}
export -f matcher
find /path/to/input2_files -type f -name "input2" \
-exec bash -c 'matcher "$@"' _ {} +
References
The {} +
with find builds the command line and execute the subshell command , our function in this case, once for all. See [ find ] manpage.
Note the I have used >>
to append the output of subsequent runs to the output file. If this is not desired use >
.
The pattern with -name
should be adjusted to match all the input2
filenames
Upvotes: 0