Reputation: 111
I have multiple csv files in a directory, however, the csv files are mixed, some of it contains headers, some do not have headers. (for eg, file_01.csv, file_02.csv, file_03.csv, file_04.csv)
Is there efficient shell script method to merge all csv files together, with the first the row containing the header?
Once merging, I intended to de-duplicate it to remove duplicate rows in that merged csv file.
Here is an example
For eg. file01.csv #got header
name, dob, gender, date_of_enrolment
John, 11-06-01, M, 20-01-2020
Beth, 12-03-01, F, 20-01-2020
For eg. file02.csv #no header
John, 11-06-01, M, 20-01-2020
Mary, 13-01-01, F, 20-02-2020
For eg. file03.csv #got header
name, dob, gender, date_of_enrolment
Mary, 13-01-01, F, 20-02-2020
David, 13-02-02, M, 20-03-2020
desired output merged_file.csv
name, dob, gender, date_of_enrolment
John, 11-06-01, M, 20-01-2020
Beth, 12-03-01, F, 20-01-2020
Mary, 13-01-01, F, 20-02-2020
David, 13-02-02, M, 20-03-2020
Upvotes: 1
Views: 571
Reputation: 140417
If you make sure the first csv file read in does have a header, you can just do this:
shopt -s extglob; awk '!seen[$0]++' file01.csv file!(01).csv
Note that the above assumes you are using bash and file01.csv
has a header
If you don't want to worry about file order, you can do this, but note that it does create two memory copies for each line -- this is necessary to preserve order
awk '
!/name,/ && !seen[$0]++ {
s = s sep $0
sep=ORS
}
/name,/{h=$0}
END{print h ORS s}' file*.csv
$ awk '
!/name,/ && !seen[$0]++ {
s = s sep $0
sep=ORS
}
/name,/{h=$0}
END{print h ORS s}' file*.csv
name, dob, gender, date_of_enrolment
John, 11-06-01, M, 20-01-2020
Beth, 12-03-01, F, 20-01-2020
Mary, 13-01-01, F, 20-02-2020
David, 13-02-02, M, 20-03-2020
Upvotes: 2
Reputation: 203985
$ awk '{print /[0-9]/,$0}' file* | sort -k1,1n -uk2 | cut -d' ' -f2-
name, dob, gender, date_of_enrolment
Beth, 12-03-01, F, 20-01-2020
David, 13-02-02, M, 20-03-2020
John, 11-06-01, M, 20-01-2020
Mary, 13-01-01, F, 20-02-2020
The above prepends a column to indicate the type of each line (header is 0, other is 1) then sorts on that column first to get the header at the top, then sorts on the original lines and then removes the extra column we added in step 1. Since only sort
has to handle all of the input at once and it's designed to use paging, etc. to handle large files, that should work even if your input files are massive plus it'll be very fast to execute.
Here's the steps:
$ awk '{print /[0-9]/,$0}' file*
0 name, dob, gender, date_of_enrolment
1 John, 11-06-01, M, 20-01-2020
1 Beth, 12-03-01, F, 20-01-2020
1 John, 11-06-01, M, 20-01-2020
1 Mary, 13-01-01, F, 20-02-2020
0 name, dob, gender, date_of_enrolment
1 Mary, 13-01-01, F, 20-02-2020
1 David, 13-02-02, M, 20-03-2020
$ awk '{print /[0-9]/,$0}' file* | sort -k1,1n -uk2
0 name, dob, gender, date_of_enrolment
1 Beth, 12-03-01, F, 20-01-2020
1 David, 13-02-02, M, 20-03-2020
1 John, 11-06-01, M, 20-01-2020
1 Mary, 13-01-01, F, 20-02-2020
$ awk '{print /[0-9]/,$0}' file* | sort -k1,1n -uk2 | cut -d' ' -f2-
name, dob, gender, date_of_enrolment
Beth, 12-03-01, F, 20-01-2020
David, 13-02-02, M, 20-03-2020
John, 11-06-01, M, 20-01-2020
Mary, 13-01-01, F, 20-02-2020
The above is just a minor adjustment to the idiomatic way to sort input but retain the header at the top:
awk '{print FNR>1,$0}' file* | sort -k1,1n -k2 | cut -d' ' -f2-
The only significant difference is using /[0-9]/
instead of FNR>1
to separate the header line. If /[0-9]/
isn't adequate to match only/all your non-header lines then use whatever is, e.g. maybe !/^name/
.
Upvotes: 0
Reputation: 1027
Another way:
cat file0{1..4}.csv > merge.csv
sort -r merge.csv |uniq > output.csv
Then, the output.csv
file would look like this:
name, dob, gender, date_of_enrolment
Mary, 13-01-01, F, 20-02-2020
John, 11-06-01, M, 20-01-2020
David, 13-02-02, M, 20-03-2020
Beth, 12-03-01, F, 20-01-2020
Upvotes: 1
Reputation: 88756
With GNU grep:
cat file*.csv | grep -m 1 ^name; grep -vh ^name file*.csv | sort -u
Upvotes: 2