Reputation: 73
I have multiple files (*.csv) that have a common column ("Common_Title") with corresponding values ("Value#") and I need to merge these into a summary file.
A difficult part is I need all values to get passed onto the summary file, but individual files sometimes lack a certain "Common_Title" and therefore need to put a "0000" in its "Value#" place.
This is 3 example files, all a bit different
# File1.csv
Common_Title,Value1
AAAAA,1111
BBBBB,1111
CCCCC,1111
# File2.csv
Common_Title,Value2
AAAAA,2222
BBBBB,2222
DDDDD,2222
EEEEE,2222
# File3.csv
Common_Title,Value3
AAAAA,3333
BBBBB,3333
CCCCC,3333
EEEEE,3333
which I'd like to merge into this summary file
# MergedFiles123.csv
Common_Title,Value1,Value2,Value3
AAAAA,1111,2222,3333
BBBBB,1111,2222,3333
CCCCC,1111,0000,3333
DDDDD,0000,2222,0000
EEEEE,0000,2222,3333
Upvotes: 2
Views: 167
Reputation: 26665
Different approach but same output using GNU awk:
awk 'BEGIN {
FS = OFS = ","
PROCINFO["sorted_in"] = "@val_str_asc"
}
FNR == 1 {
filecount++
numfields[filecount] = NF
if (NR == 1) {
a = split($0, header, FS)
} else {
for (i = 2; i <= NF; i++) {
header[++a] = $i
}
}
}
FNR > 1 {
for (j = 2; j <= NF; j++) {
b[$1][filecount, j] = $j
}
}
END {
for (k = 1; k <= length(header); k++) {
printf "%s%s", header[k], ((k < length(header)) ? OFS : ORS)
}
for (l in b) {
printf "%s", l OFS
for (m = 1; m <= filecount; m++) {
for (n = 2; n <= numfields[m]; n++) {
printf "%s%s",
(b[l][m, n] == "" ? "0000" : b[l][m, n]),
((m + n < filecount + numfields[m]) ? OFS : ORS)
}
}
}
}' File*.csv
Common_Title,Value1,Value2,Value3
AAAAA,1111,2222,3333
BBBBB,1111,2222,3333
CCCCC,1111,0000,3333
DDDDD,0000,2222,0000
EEEEE,0000,2222,3333
This solution can handle a discordant number of fields between files, it fills missing values with "NA", and processes the header separately so the rest of the output can be sorted.
With different example data:
head file*
==> file1 <==
ID,Value
A1,10
A2,20
A3,30
A4,40
==> file2 <==
ID,Score,Extra
A2,200,True
A1,100,False
==> file3 <==
ID,Evaluation
A1,Correct
A3,Incorrect
==> file4 <==
ID,Value1,Value2,Value3,Value4
A1,,1,1
A2,3,3,3,3
awk 'BEGIN {
FS = OFS = ","
PROCINFO["sorted_in"] = "@val_str_asc"
}
FNR == 1 {
filecount++
numfields[filecount] = NF
if (NR == 1) {
a = split($0, header, FS)
} else {
for (i = 2; i <= NF; i++) {
header[++a] = $i
}
}
}
FNR > 1 {
for (j = 2; j <= NF; j++) {
b[$1][filecount, j] = $j
}
}
END {
for (k = 1; k <= length(header); k++) {
printf "%s%s", header[k], ((k < length(header)) ? OFS : ORS)
}
for (l in b) {
printf "%s", l OFS
for (m = 1; m <= filecount; m++) {
for (n = 2; n <= numfields[m]; n++) {
printf "%s%s",
(b[l][m, n] == "" ? "NA" : b[l][m, n]),
((m + n < filecount + numfields[m]) ? OFS : ORS)
}
}
}
}' file*
ID,Value,Score,Extra,Evaluation,Value1,Value2,Value3,Value4
A1,10,100,False,Correct,NA,1,1,NA
A2,20,200,True,NA,3,3,3,3
A3,30,NA,NA,Incorrect,NA,NA,NA,NA
A4,40,NA,NA,NA,NA,NA,NA,NA
Upvotes: 0
Reputation: 75548
Using awk:
awk -F, '!a[$1]++ { keys[++i] = $1 } !r[FILENAME]++ { ++argind } { b[$1, argind] = $2 } END { for (i = 1; i in keys; ++i) { k = keys[i]; t = k; for (j = 1; j <= argind; ++j) t = t FS (length(b[k, j]) ? b[k, j] : "0000"); print t } }' file1 file2 file3
Output:
Common_Title,Value1,Value2,Value3
AAAAA,1111,2222,3333
BBBBB,1111,2222,3333
CCCCC,1111,0000,3333
DDDDD,0000,2222,0000
EEEEE,0000,2222,3333
Comments:
!a[$1]++ { keys[++i] = $1 }
With the condition !a[$1]++
, block ({}
) could only evaluate when $1
is first encountered. Inside the block, the key ($1
) is added to the keys
array once.!r[FILENAME]++ { ++argind }
Increments the counter argind
everytime a new file is encountered.{ b[$1, argind] = $2 }
Stores the data with reference to key
and the current file's index argind
.END { for (i = 1; i in keys; ++i) { k = keys[i]; t = k; for (j = 1; j <= argind; ++j) t = t FS (length(b[k, j]) ? b[k, j] : "0000"); print t } }'
Runs after all files are processed. It walks through every key in keys
in order; and prints all data based from that key as a single line which is taken from b[]
with included reference to the indices.Upvotes: 2