SumNeuron
SumNeuron

Reputation: 5188

AWK: finding common elements across arbitrary number of columns (either single column files or column matrix)

Problem

I have several files, each one column, and I want to compare each of them to one another to find what elements are contained across all files. Alternatively - if it is easier - I could make a column matrix.

Question

How can I find the common elements across multiple columns.

Request

I am not an expert at awk (obviously). So a verbose explanation of the code would be much appreciated.

Other

@ joepvd made some code that was somewhat similar... https://unix.stackexchange.com/questions/216511/comparing-the-first-column-of-two-files-and-printing-the-entire-row-of-the-secon/216515#216515?newreg=f4fd3a8743aa4210863f2ef527d0838b

Upvotes: 0

Views: 295

Answers (3)

cas
cas

Reputation: 735

I'm going to assume that it's the problem that matters, not the implementation language so here's an alternative using perl:

#! /usr/bin/perl

use strict;

my %elements=();    
my $filecount=@ARGV;

while(<>) {
  $elements{$_}->{$ARGV}++;
};

print grep {!/^$/} map {
    "$_" if (keys %{ $elements{$_} } == $filecount)
} (keys %elements);

The while loop builds a hash-of-hashes (aka "HoH". See man perldsc and man perllol for details. Also see below for an example), with the top level key being each line from each input file, and the second-level key being the names of the file(s) that value appeared in.

The grep ... map {...} returns each top-level key where the number of files it appears in is equal to the number of input files

Here's what the data structure looks like, using the example you gave to ilkkachu:

{
  'A' => { 'file1' => 1 },
  'B' => { 'file2' => 1 },
  'C' => { 'file1' => 1, 'file2' => 1, 'file3' => 1 },
  'E' => { 'file2' => 1 },
  'F' => { 'file1' => 1 },
  'K' => { 'file3' => 1 },
  'L' => { 'file3' => 1 }
}

Note that if there happen to be any duplicates in a single file, that fact is stored in this structure and can be checked.

The grep before the map isn't strictly required in this particular example, but is useful if you want to store the result in an array for further processing rather than print it immediately.

With the grep, it returns an array of only the matching elements, or in this case just the single value C. Without it, it returns an array of empty strings plus the matching elements. e.g. ("", "", "", "", "C", "", ""). Actually, they return the elements with a newline (\n) at the end because I didn't use chomp in the while loop as I knew i'd be printing them directly. In most programs, i'd use chomp to strip newlines and/or carriage-returns.

Upvotes: 0

ilkkachu
ilkkachu

Reputation: 6527

If you can have the same value multiple times in a single file, we'll need to take care to only count it once for each file.

A couple of variations with GNU awk (which is needed for ARGIND to be available. It could be emulated by checking FILENAME but that's even uglier.)

gawk '{ A[$0] = or(A[$0], lshift(1, ARGIND-1))  }  
      END { for (x in A) if (A[x] == lshift(1, ARGIND) - 1) print x }'
      file1 file2 file3

The array A is keyed by the values (lines), and holds a bitmap of the files in which a line has been found. For each line read, we set bit number ARGIND-1 (since ARGIND starts with one).

At the end of input, run through all saved lines, and print them if the bitmap is all ones (up to the number of files seen).

gawk 'ARGIND > LASTIND { 
            LASTIND = ARGIND; for (x in CURR) { ALL[x] += 1; delete CURR[x] } 
      } 
      { CURR[$0] = 1 }
      END { for (x in CURR) ALL[x] += 1; 
            for (x in ALL) if (ALL[x] == ARGIND) print x 
      }' file1 file2 file3

Here, when a line is encountered, the corresponding element in arrayCURR, is set (middle part). When the file number changes (ARGIND > LASTIND), values in array ALL are increased for all values set in CURR, and the latter is cleared. At the END of input, the values in ALL are updated for the last file, and the total count is checked against the total number of files, printing the ones that appear in all files.


The bitmap approach is likely slightly faster with large inputs, since it doesn't involve creating and walking through a temporary array, but the number of files it can handle is limited by the number of bits the bit operations can handle (which seems to be about 50 on 64-bit Linux).

In both cases, the resulting printout will be in essentially a random order, since associative arrays do not preserve ordering.

Upvotes: 0

sjsam
sjsam

Reputation: 21965

to find what elements are contained across all files

awk is your friend as you guessed. Use the procedure below

#Store the files in an array. Assuming all files in one place
filelist=( $(find . -maxdepth 1 -type f) ) #array of files
awk -v count="${#filelist[@]}" '{value[$1]++}END{for(i in value){
if(value[i]==count){printf "Value %d is found in all files\n",i}}}' "${filelist[@]}"

Note

  1. We used -v count="${#filelist[@]}" to pass the total file count to awk Note # in the beginning of an array gives element count.
  2. value[$1]++ increments the count of a value as seen in the file. Also it creates value[$1] if not already exist with the initial value zero.
  3. This method fails, if a value appear in a file more than once.
  4. And END block with awk is executed only at last, ie after every records from all the files have been processed.

Upvotes: 2

Related Questions