finding common rows in files based on one column

Question

I have 15 files like

file1.csv

a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0

file2.csv

d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0

. . . .

file15.csv

s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0

I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.

in this above case

output is:

cg2

(taking it is common to second field of all 15 files)

I am little new to unix and shell scripting, please help

Dennis Williamson · Accepted Answer

Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.

The former:

a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .

The latter:

cg2
. . .

If the latter, then this should work

awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv

Broken out on multiple lines:

awk -F, '{
             arr[$2]++; 
             if (FILENAME != prevfile) {
                 c++; 
                 prevfile = FILENAME
             }
         }
         END {
             for (i in arr) {
                 if (arr[i] >= c) {
                     print i
                 }
             }
         }' file*.csv

Explanation:

increment the count of the number of times a field 2 value occurs
if the filename changes, increment the count of files (the first file changes from a null string to its filename and the count increments from 0 to 1)
save the current filename
once all the counting is done, iterate of the array by its keys
if the count contained in the array is greater than or equal to the number of files, then the field 2 value appeared in all the files (by checking for >= instead of == this will work in case a value appears more than once in a single file)
so print the key (which is a field 2 value)
a glob is used to get all the files, but you could list them explicitly

Edit:

Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.

awk -F, '
         FILENAME == first && flag {
             exit
         }
         ! first {
             first = FILENAME
         }
         FILENAME != first {
             flag = 1
         }
         {
             arr[$2]++; 
             if (FILENAME != prevfile) {
                 c++; 
                 prevfile = FILENAME
             }
         }
         END {
             # print the matching lines
             do {
                 if ($2 in arr) {
                     print;
                 }
             } while (getline);
             # print the list of words
             for (i in arr) {
                 if (arr[i] >= c) {
                     print i
                 }
             }
         }' file*.csv file*.csv

It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.

It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while or for).

If you print only the full lines, you can pipe the output to:

sort -t , -k2,2

to have them grouped.

Piping only the list of words to:

sort

will put them in the same order for easier comparison.

finding common rows in files based on one column

Answers (2)

Related Questions