Reputation: 1645
I have three files as shown below
file1.txt
"aba" 0 0
"aba" 0 0 1
"abc" 0 1
"abd" 1 1
"xxx" 0 0
file2.txt
"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1
file3.txt
"xyx" 0 0
"aba" 0 0
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1
I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like
awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt
But, how can we find similar elements in all the files, when the input files are more than 2? Can anyone help?
With the current awk solution, the output ignores the duplicate key columns and gives the output as
"xxx" 0 0
If we assume the output comes from file1.txt, the expected output is:
"aba" 0 0
"aba" 0 0 1
"xxx" 0 0
i.e it should get the rows with duplicate key columns as well.
Upvotes: 7
Views: 4047
Reputation: 36252
Try following solution generalized for N files. It saves data of first file in a hash with value of 1
, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.
awk '
FNR == NR { arr[$1,$2] = 1; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
split( key, key_arr, SUBSEP )
printf "%s %s\n", key_arr[1], key_arr[2]
}
}
' file{1..3}
It yields:
"xxx" 0
"aba" 0
EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf
function. I've left old code commented.
awk '
##FNR == NR { arr[$1,$2] = 1; next }
FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
##split( key, key_arr, SUBSEP )
##printf "%s %s\n", key_arr[1], key_arr[2]
printf "%s\n", line[ key ]
}
}
' file{1..3}
NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0
with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
. At the time of printing I do the reverse splitting with the separator (SUBSEP
variable) and print each entry.
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' file{1..3}
With new data edited in question, it yields:
"xxx" 0 0
"aba" 0 0
"aba" 0 0 1
Upvotes: 3
Reputation: 54392
For three files, all you need is:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file2.txt file3.txt
The FNR==NR
block returns true for only the first file in the arguments list. The next
statement in this block forces a skip over the remained of the code. Therefore, ($1,$2) in a
is performed for all files in the arguments list excluding the first one. To process more files in the way you have, all you need to do is list them.
If you need more powerful globbing on the command line, use extglob
. You can turn it on with shopt -s extglob
, and turn it off with shopt -u extglob
. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt !(file1.txt)
If you have hard to find files, use find
. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt $(find /path/to/files -type f -name "*[23].txt")
I assume you're looking for a glob range for 'N' files. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file{2,3}.txt
Upvotes: 1
Reputation: 2243
This python script will list out the common lines among all files :
import sys
i,l = 0,[]
for files in sys.argv[1:]:
l.append(set())
for line in open(files): l[i].add(" ".join(line.split()[0:2]))
i+=1
commonFields = reduce(lambda s1, s2: s1 & s2, l)
for files in sys.argv[1:]:
print "Common lines in ",files
for line in open(files):
for fields in commonFields:
if fields in line:
print line,
break
Usage : python script.py file1 file2 file3 ...
Upvotes: 1