Saurabh Gupta
Saurabh Gupta

Reputation: 1

Remove Duplicates and take maximum values for other records while merging two files in Unix

I have two files a.txt and b.txt with 6 columns each. I want to merge both files and remove exact duplicates based on all 6 columns while for other cases I would like to have maximum of column 3,4,5,6 based on grouping of column 1 and 2. I would like to explain this by an example below

File a.txt

Col1 Col2 Col3 Col4 Col5 Col6

A1 B1 1 1 1 1

A2 B2 1 0 0 1

File b.txt

Col1 Col2 Col3 Col4 Col5 Col6

A1 B1 1 1 1 1

A2 B2 0 1 0 1

Result should be :

Col1 Col2 Col3 Col4 Col5 Col6

A1 B1 1 1 1 1

A2 B2 1 1 0 1

So grouping is on Col1 and Col2 and if other 4 columns have same value in both files they are exact duplicates and duplicate should be removed but if they have different values in both the files then max value or 1 should be selected

Upvotes: 0

Views: 156

Answers (2)

karakfa
karakfa

Reputation: 67467

a two step approach will be easier.

First define a function to create key for join

$ f() { awk '{print $1"_"$2 "\t" $0}' "$1" | sort; }

join the files with the key, and let awk pick the max of each field

$ join <(f file1) <(f file2) | 
  awk 'function max(x,y) {return x>y?x:y}
      {n=(NF+1)/2; 
       printf "%s %s",$2,$3; 
       for(i=4;i<=n;i++) printf " %s",max($i,$(i+n-1)); 
       print ""}'

A1 B1 1 1 1 1
A2 B2 1 1 0 1

Upvotes: 0

Shawn
Shawn

Reputation: 52344

Using a mix of sort -u to get rid of the duplicates and GNU datamash to find the max values of each group's columns makes it trivial:

$ sort -u a.txt b.txt | datamash -W -g1,2 max 3-6
A1  B1  1   1   1   1
A2  B2  1   1   0   1

Upvotes: 1

Related Questions