Reputation: 111

Merge multiple csv files where it contains a mix of headers and no headers

I have multiple csv files in a directory, however, the csv files are mixed, some of it contains headers, some do not have headers. (for eg, file_01.csv, file_02.csv, file_03.csv, file_04.csv)

Is there efficient shell script method to merge all csv files together, with the first the row containing the header?

Once merging, I intended to de-duplicate it to remove duplicate rows in that merged csv file.

Here is an example

For eg. file01.csv #got header
name, dob, gender, date_of_enrolment
John, 11-06-01, M, 20-01-2020
Beth, 12-03-01, F, 20-01-2020 

For eg. file02.csv #no header
John, 11-06-01, M, 20-01-2020
Mary, 13-01-01, F, 20-02-2020

For eg. file03.csv #got header 
name, dob, gender, date_of_enrolment
Mary, 13-01-01, F, 20-02-2020
David, 13-02-02, M, 20-03-2020

desired output merged_file.csv
name, dob, gender, date_of_enrolment
John, 11-06-01, M, 20-01-2020
Beth, 12-03-01, F, 20-01-2020
Mary, 13-01-01, F, 20-02-2020
David, 13-02-02, M, 20-03-2020

Upvotes: 1

Answers (4)

SiegeX

Reputation: 140417

If you make sure the first csv file read in does have a header, you can just do this:

shopt -s extglob; awk '!seen[$0]++' file01.csv file!(01).csv

Note that the above assumes you are using bash and file01.csv has a header

If you don't want to worry about file order, you can do this, but note that it does create two memory copies for each line -- this is necessary to preserve order

  awk '
   !/name,/ && !seen[$0]++ {
      s = s sep $0
      sep=ORS
   }

   /name,/{h=$0}
   END{print h ORS s}' file*.csv

Proof of Concept

$ awk '
       !/name,/ && !seen[$0]++ {
          s = s sep $0
          sep=ORS
       }

       /name,/{h=$0}
       END{print h ORS s}' file*.csv

name, dob, gender, date_of_enrolment
John, 11-06-01, M, 20-01-2020
Beth, 12-03-01, F, 20-01-2020
Mary, 13-01-01, F, 20-02-2020
David, 13-02-02, M, 20-03-2020

Upvotes: 2

Ed Morton

Reputation: 203985

$ awk '{print /[0-9]/,$0}' file* | sort -k1,1n -uk2 | cut -d' ' -f2-
name, dob, gender, date_of_enrolment
Beth, 12-03-01, F, 20-01-2020
David, 13-02-02, M, 20-03-2020
John, 11-06-01, M, 20-01-2020
Mary, 13-01-01, F, 20-02-2020

The above prepends a column to indicate the type of each line (header is 0, other is 1) then sorts on that column first to get the header at the top, then sorts on the original lines and then removes the extra column we added in step 1. Since only sort has to handle all of the input at once and it's designed to use paging, etc. to handle large files, that should work even if your input files are massive plus it'll be very fast to execute.

Here's the steps:

$ awk '{print /[0-9]/,$0}' file*
0 name, dob, gender, date_of_enrolment
1 John, 11-06-01, M, 20-01-2020
1 Beth, 12-03-01, F, 20-01-2020
1 John, 11-06-01, M, 20-01-2020
1 Mary, 13-01-01, F, 20-02-2020
0 name, dob, gender, date_of_enrolment
1 Mary, 13-01-01, F, 20-02-2020
1 David, 13-02-02, M, 20-03-2020

$ awk '{print /[0-9]/,$0}' file* | sort -k1,1n -uk2
0 name, dob, gender, date_of_enrolment
1 Beth, 12-03-01, F, 20-01-2020
1 David, 13-02-02, M, 20-03-2020
1 John, 11-06-01, M, 20-01-2020
1 Mary, 13-01-01, F, 20-02-2020

$ awk '{print /[0-9]/,$0}' file* | sort -k1,1n -uk2 | cut -d' ' -f2-
name, dob, gender, date_of_enrolment
Beth, 12-03-01, F, 20-01-2020
David, 13-02-02, M, 20-03-2020
John, 11-06-01, M, 20-01-2020
Mary, 13-01-01, F, 20-02-2020

The above is just a minor adjustment to the idiomatic way to sort input but retain the header at the top:

awk '{print FNR>1,$0}' file* | sort -k1,1n -k2 | cut -d' ' -f2-

The only significant difference is using /[0-9]/ instead of FNR>1 to separate the header line. If /[0-9]/ isn't adequate to match only/all your non-header lines then use whatever is, e.g. maybe !/^name/.

Upvotes: 0

mtnezm

Reputation: 1027

Another way:

Merging the files into one .csv:

cat file0{1..4}.csv > merge.csv

De-duplicate rows (and keep headers in the first line of the file):

sort -r merge.csv |uniq > output.csv

Then, the output.csv file would look like this:

name, dob, gender, date_of_enrolment
Mary, 13-01-01, F, 20-02-2020
John, 11-06-01, M, 20-01-2020
David, 13-02-02, M, 20-03-2020
Beth, 12-03-01, F, 20-01-2020

Upvotes: 1

Cyrus

Reputation: 88756

With GNU grep:

cat file*.csv | grep -m 1 ^name; grep -vh ^name file*.csv | sort -u

Upvotes: 2

Merge multiple csv files where it contains a mix of headers and no headers

Answers (4)

Proof of Concept

Related Questions