ishan3243
ishan3243

Reputation: 1928

Merge two CSVs while resolving duplicates

I have a couple of csv's that i need to merge. I want to consider entries which have the same first and second columns as duplicates. I know the command for this is something like

sort -t"," -u -k 1,1 -k 2,2 file1 file2

I additionally want to resolve the duplicates in such a way that the entry from the second file is chosen everytime. What is the way to do that?

Upvotes: 1

Views: 97

Answers (2)

Jim Mischel
Jim Mischel

Reputation: 134125

If the suggestion to reverse the order of files to the sort command doesn't work (see other answer), another way to do this would be to concatenate the files, file2 first, and then sort them with the -s switch.

cat file2 file1 | sort -t"," -u -k 1,1 -k 2,2 -s

-s forces a stable sort, meaning that identical lines will appear in the same relative order. Since the input to sort has all of the lines from file2 before file1, all of the duplicates in the output should come from file2.

The sort man page doesn't explicitly state that input files will be read in the order that they're supplied on the command line, so I guess it's possible that an implementation could read the files in reverse order, or alternating lines, or whatever. But if you concatenate the files first then there's no ambiguity.

Upvotes: 1

luoluo
luoluo

Reputation: 5533

Change the order of the two files and add -s(@Jim Mischel give the hit) would solve your problem.

sort -t"," -u -k 1,1 -k 2,2 -s file2 file1      

man sort
   -u, --unique
          with -c, check for strict ordering; without -c, output only  the
          first of an equal run

   -s, --stable
          stabilize sort by disabling last-resort comparison

Short answer

awk -F"," '{out[$1$2]=$0} END {for(i in out) {print out[i]}}' file1 file2

A bit long answer:

awk 'BEGIN {
               FS=OFS=","; # set ',' as field separator
           }  

           {
             out[$1$2]=$0; # save the value to dict, new value would replace old value.
           } 

      END  { 
             for (i in out) {  # in the end, print all value of the dict
                 print out[i];
             }
           }' file1 file2

Upvotes: 1

Related Questions