Reputation: 1476
I have 15 files like
file1.csv
a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0
file2.csv
d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0
. . . .
file15.csv
s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0
I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.
in this above case
output is:
cg2
(taking it is common to second field of all 15 files)
I am little new to unix and shell scripting, please help
Upvotes: 1
Views: 2975
Reputation: 360095
Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.
The former:
a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .
The latter:
cg2
. . .
If the latter, then this should work
awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv
Broken out on multiple lines:
awk -F, '{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv
Explanation:
>=
instead of ==
this will work in case a value appears more than once in a single file)Edit:
Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.
awk -F, '
FILENAME == first && flag {
exit
}
! first {
first = FILENAME
}
FILENAME != first {
flag = 1
}
{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
# print the matching lines
do {
if ($2 in arr) {
print;
}
} while (getline);
# print the list of words
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv file*.csv
It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.
It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while
or for
).
If you print only the full lines, you can pipe the output to:
sort -t , -k2,2
to have them grouped.
Piping only the list of words to:
sort
will put them in the same order for easier comparison.
Upvotes: 3
Reputation: 88378
Fun problem.
One way to do it, entirely in Bash, is as follows.
One thing you will need to invoke is join -t ',' -1 2 -2 2 file1 file2
to join on the second column of two files. Before you can join, though, you must sort on the second column.
Do successive joins in a for-loop, because join
takes only two files as arguments.
ADDENDUM
Here is a little transcript showing successive joins. You can adapt it fairly easily, I think.
$ cat 1.csv
a,b,c,d
e,f,g,h
i,j,k,l
$ cat 2.csv
7,5,4,3
3,b,s,e
2,f,5,5
$ cat 3.csv
4,5,6,7
0,0,0,0
1,b,4,4
$ join -t ',' -1 2 -2 2 1.csv 2.csv | cut -f 1 -d ',' > temp
$ cat temp
b
f
$ join -t ',' -2 2 temp 3.csv | cut -f 1 -d ','
b
The first join (on the first two files) produces the joined value in the first column of the result. So as you join to file3, file4, file5, etc. You will be using the first column of the result you are generating, which is why you only need the -2 option. To keep things very efficient, always cut out all but the first column each time you do the join.
Upvotes: 1