Reputation: 1
I have two files a.txt and b.txt with 6 columns each. I want to merge both files and remove exact duplicates based on all 6 columns while for other cases I would like to have maximum of column 3,4,5,6 based on grouping of column 1 and 2. I would like to explain this by an example below
File a.txt
Col1 Col2 Col3 Col4 Col5 Col6
A1 B1 1 1 1 1
A2 B2 1 0 0 1
File b.txt
Col1 Col2 Col3 Col4 Col5 Col6
A1 B1 1 1 1 1
A2 B2 0 1 0 1
Result should be :
Col1 Col2 Col3 Col4 Col5 Col6
A1 B1 1 1 1 1
A2 B2 1 1 0 1
So grouping is on Col1 and Col2 and if other 4 columns have same value in both files they are exact duplicates and duplicate should be removed but if they have different values in both the files then max value or 1 should be selected
Upvotes: 0
Views: 156
Reputation: 67467
a two step approach will be easier.
First define a function to create key for join
$ f() { awk '{print $1"_"$2 "\t" $0}' "$1" | sort; }
join the files with the key, and let awk pick the max of each field
$ join <(f file1) <(f file2) |
awk 'function max(x,y) {return x>y?x:y}
{n=(NF+1)/2;
printf "%s %s",$2,$3;
for(i=4;i<=n;i++) printf " %s",max($i,$(i+n-1));
print ""}'
A1 B1 1 1 1 1
A2 B2 1 1 0 1
Upvotes: 0
Reputation: 52344
Using a mix of sort -u
to get rid of the duplicates and GNU datamash
to find the max values of each group's columns makes it trivial:
$ sort -u a.txt b.txt | datamash -W -g1,2 max 3-6
A1 B1 1 1 1 1
A2 B2 1 1 0 1
Upvotes: 1