chas
chas

Reputation: 1645

find common elements in >2 files

I have three files as shown below

file1.txt

"aba" 0 0 
"aba" 0 0 1
"abc" 0 1
"abd" 1 1 
"xxx" 0 0

file2.txt

"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1

file3.txt

"xyx" 0 0
"aba" 0 0 
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1

I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like

awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

But, how can we find similar elements in all the files, when the input files are more than 2? Can anyone help?

With the current awk solution, the output ignores the duplicate key columns and gives the output as

"xxx" 0 0

If we assume the output comes from file1.txt, the expected output is:

"aba" 0 0 
"aba" 0 0 1
"xxx" 0 0 

i.e it should get the rows with duplicate key columns as well.

Upvotes: 7

Views: 4047

Answers (3)

Birei
Birei

Reputation: 36252

Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.

awk '
    FNR == NR { arr[$1,$2] = 1; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            split( key, key_arr, SUBSEP )
            printf "%s %s\n", key_arr[1], key_arr[2] 
        } 
    }
' file{1..3}

It yields:

"xxx" 0
"aba" 0

EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.

awk '
    ##FNR == NR { arr[$1,$2] = 1; next }
    FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            ##split( key, key_arr, SUBSEP )
            ##printf "%s %s\n", key_arr[1], key_arr[2] 
            printf "%s\n", line[ key ] 
        } 
    }
' file{1..3}

NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.

awk '
    FNR == NR { 
        arr[$1,$2] = 1
        line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
        next
    }
    FNR == 1 { delete found }
    { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
    END { 
        num_files = ARGC -1 
        for ( key in arr ) {
            if ( arr[key] < num_files ) { continue }
            split( line[ key ], line_arr, SUBSEP )
            for ( i = 1; i <= length( line_arr ); i++ ) { 
                printf "%s\n", line_arr[ i ]
            } 
        } 
    }
' file{1..3}

With new data edited in question, it yields:

"xxx" 0 0
"aba" 0 0 
"aba" 0 0 1

Upvotes: 3

Steve
Steve

Reputation: 54392

For three files, all you need is:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file2.txt file3.txt

The FNR==NR block returns true for only the first file in the arguments list. The next statement in this block forces a skip over the remained of the code. Therefore, ($1,$2) in a is performed for all files in the arguments list excluding the first one. To process more files in the way you have, all you need to do is list them.


If you need more powerful globbing on the command line, use extglob. You can turn it on with shopt -s extglob, and turn it off with shopt -u extglob. For example:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt !(file1.txt)

If you have hard to find files, use find. For example:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt $(find /path/to/files -type f -name "*[23].txt")

I assume you're looking for a glob range for 'N' files. For example:

awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file{2,3}.txt

Upvotes: 1

Sidharth C. Nadhan
Sidharth C. Nadhan

Reputation: 2243

This python script will list out the common lines among all files :

import sys
i,l = 0,[]
for files in sys.argv[1:]:
  l.append(set())
  for line in open(files): l[i].add(" ".join(line.split()[0:2]))
  i+=1
commonFields =  reduce(lambda s1, s2: s1 & s2, l)
for files in sys.argv[1:]:
  print "Common lines in ",files
  for line in open(files):
    for fields in commonFields:
      if fields in line:
        print line,
        break

Usage : python script.py file1 file2 file3 ...

Upvotes: 1

Related Questions