Tanmaya Meher
Tanmaya Meher

Reputation: 1476

finding common rows in files based on one column

I have 15 files like

file1.csv

a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0

file2.csv

d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0

. . . .

file15.csv

s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0

I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.

in this above case

output is:

cg2

(taking it is common to second field of all 15 files)

I am little new to unix and shell scripting, please help

Upvotes: 1

Views: 2975

Answers (2)

Dennis Williamson
Dennis Williamson

Reputation: 360095

Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.

The former:

a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .

The latter:

cg2
. . .

If the latter, then this should work

awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv

Broken out on multiple lines:

awk -F, '{
             arr[$2]++; 
             if (FILENAME != prevfile) {
                 c++; 
                 prevfile = FILENAME
             }
         }
         END {
             for (i in arr) {
                 if (arr[i] >= c) {
                     print i
                 }
             }
         }' file*.csv

Explanation:

  • increment the count of the number of times a field 2 value occurs
  • if the filename changes, increment the count of files (the first file changes from a null string to its filename and the count increments from 0 to 1)
  • save the current filename
  • once all the counting is done, iterate of the array by its keys
  • if the count contained in the array is greater than or equal to the number of files, then the field 2 value appeared in all the files (by checking for >= instead of == this will work in case a value appears more than once in a single file)
  • so print the key (which is a field 2 value)
  • a glob is used to get all the files, but you could list them explicitly

Edit:

Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.

awk -F, '
         FILENAME == first && flag {
             exit
         }
         ! first {
             first = FILENAME
         }
         FILENAME != first {
             flag = 1
         }
         {
             arr[$2]++; 
             if (FILENAME != prevfile) {
                 c++; 
                 prevfile = FILENAME
             }
         }
         END {
             # print the matching lines
             do {
                 if ($2 in arr) {
                     print;
                 }
             } while (getline);
             # print the list of words
             for (i in arr) {
                 if (arr[i] >= c) {
                     print i
                 }
             }
         }' file*.csv file*.csv

It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.

It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while or for).

If you print only the full lines, you can pipe the output to:

sort -t , -k2,2

to have them grouped.

Piping only the list of words to:

sort

will put them in the same order for easier comparison.

Upvotes: 3

Ray Toal
Ray Toal

Reputation: 88378

Fun problem.

One way to do it, entirely in Bash, is as follows.

One thing you will need to invoke is join -t ',' -1 2 -2 2 file1 file2 to join on the second column of two files. Before you can join, though, you must sort on the second column.

Do successive joins in a for-loop, because join takes only two files as arguments.

ADDENDUM

Here is a little transcript showing successive joins. You can adapt it fairly easily, I think.

$ cat 1.csv
a,b,c,d
e,f,g,h
i,j,k,l
$ cat 2.csv
7,5,4,3
3,b,s,e
2,f,5,5
$ cat 3.csv
4,5,6,7
0,0,0,0
1,b,4,4
$ join -t ',' -1 2 -2 2 1.csv 2.csv | cut -f 1 -d ',' > temp
$ cat temp
b
f
$ join -t ',' -2 2 temp 3.csv | cut -f 1 -d ','
b

The first join (on the first two files) produces the joined value in the first column of the result. So as you join to file3, file4, file5, etc. You will be using the first column of the result you are generating, which is why you only need the -2 option. To keep things very efficient, always cut out all but the first column each time you do the join.

Upvotes: 1

Related Questions