Reputation: 5188
I have several files, each one column, and I want to compare each of them to one another to find what elements are contained across all files. Alternatively - if it is easier - I could make a column matrix.
How can I find the common elements across multiple columns.
I am not an expert at awk (obviously). So a verbose explanation of the code would be much appreciated.
@ joepvd made some code that was somewhat similar... https://unix.stackexchange.com/questions/216511/comparing-the-first-column-of-two-files-and-printing-the-entire-row-of-the-secon/216515#216515?newreg=f4fd3a8743aa4210863f2ef527d0838b
Upvotes: 0
Views: 295
Reputation: 735
I'm going to assume that it's the problem that matters, not the implementation language so here's an alternative using perl
:
#! /usr/bin/perl
use strict;
my %elements=();
my $filecount=@ARGV;
while(<>) {
$elements{$_}->{$ARGV}++;
};
print grep {!/^$/} map {
"$_" if (keys %{ $elements{$_} } == $filecount)
} (keys %elements);
The while
loop builds a hash-of-hashes (aka "HoH". See man perldsc
and man perllol
for details. Also see below for an example), with the top level key being each line from each input file, and the second-level key being the names of the file(s) that value appeared in.
The grep ... map {...}
returns each top-level key where the number of files it appears in is equal to the number of input files
Here's what the data structure looks like, using the example you gave to ilkkachu:
{
'A' => { 'file1' => 1 },
'B' => { 'file2' => 1 },
'C' => { 'file1' => 1, 'file2' => 1, 'file3' => 1 },
'E' => { 'file2' => 1 },
'F' => { 'file1' => 1 },
'K' => { 'file3' => 1 },
'L' => { 'file3' => 1 }
}
Note that if there happen to be any duplicates in a single file, that fact is stored in this structure and can be checked.
The grep
before the map
isn't strictly required in this particular example, but is useful if you want to store the result in an array for further processing rather than print it immediately.
With the grep
, it returns an array of only the matching elements, or in this case just the single value C
. Without it, it returns an array of empty strings plus the matching elements. e.g. ("", "", "", "", "C", "", "")
. Actually, they return the elements with a newline (\n
) at the end because I didn't use chomp
in the while
loop as I knew i'd be printing them directly. In most programs, i'd use chomp
to strip newlines and/or carriage-returns.
Upvotes: 0
Reputation: 6527
If you can have the same value multiple times in a single file, we'll need to take care to only count it once for each file.
A couple of variations with GNU awk (which is needed for ARGIND
to be available. It could be emulated by checking FILENAME
but that's even uglier.)
gawk '{ A[$0] = or(A[$0], lshift(1, ARGIND-1)) }
END { for (x in A) if (A[x] == lshift(1, ARGIND) - 1) print x }'
file1 file2 file3
The array A
is keyed by the values (lines), and holds a bitmap of the files in which a line has been found. For each line read, we set bit number ARGIND-1
(since ARGIND
starts with one).
At the end of input, run through all saved lines, and print them if the bitmap is all ones (up to the number of files seen).
gawk 'ARGIND > LASTIND {
LASTIND = ARGIND; for (x in CURR) { ALL[x] += 1; delete CURR[x] }
}
{ CURR[$0] = 1 }
END { for (x in CURR) ALL[x] += 1;
for (x in ALL) if (ALL[x] == ARGIND) print x
}' file1 file2 file3
Here, when a line is encountered, the corresponding element in arrayCURR
, is set (middle part). When the file number changes (ARGIND > LASTIND
), values in array ALL
are increased for all values set in CURR
, and the latter is cleared. At the END
of input, the values in ALL
are updated for the last file, and the total count is checked against the total number of files, printing the ones that appear in all files.
The bitmap approach is likely slightly faster with large inputs, since it doesn't involve creating and walking through a temporary array, but the number of files it can handle is limited by the number of bits the bit operations can handle (which seems to be about 50 on 64-bit Linux).
In both cases, the resulting printout will be in essentially a random order, since associative arrays do not preserve ordering.
Upvotes: 0
Reputation: 21965
to find what elements are contained across all files
awk
is your friend as you guessed. Use the procedure below
#Store the files in an array. Assuming all files in one place
filelist=( $(find . -maxdepth 1 -type f) ) #array of files
awk -v count="${#filelist[@]}" '{value[$1]++}END{for(i in value){
if(value[i]==count){printf "Value %d is found in all files\n",i}}}' "${filelist[@]}"
Note
-v count="${#filelist[@]}"
to pass the total file count to awk
Note #
in the beginning of an array gives element count.value[$1]++
increments the count of a value as seen in the file. Also it creates value[$1]
if not already exist with the initial value zero.END
block with awk is executed only at last, ie after every records from all the files have been processed.Upvotes: 2