Reputation: 23
Could you please help me to find THE bash command which will join/merge those following cvs files "template.csv + file1.csv + file2.csv + file3.csv + ... + fileX.csv" into "ouput.csv".
For each line in template.csv, concatenate associated values (if exist) listed in the fileX.csv as below:
template.csv:
header
1
2
3
4
5
6
7
8
9
file1.csv:
header,value1
2,value12
3,value13
7,value17
8,value18
9,value19
file2.csv:
header,value2
1,value21
2,value22
3,value23
4,value24
file3.csv:
header,value3
2,value32
4,value34
6,value36
7,value37
8,value38
output.csv:
header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,
My template file is containing 35137 lines.
I already developed a bash script doing this merge (based on "do while", etc...) but the performance is not good at all. Too long to make the output.csv. I'm sure that it is possible to do the same using join, awk, ... but I don't see how ...
IMPORTANT UPDATE
The first column of my real files are containing a datetime and not a simple number ... so the script must take into account the space between the date and the time ... sorry for the update !
Script should be now designed with the below csv files as example:
template.csv:
header
2000-01-01 00:00:00
2000-01-01 00:15:00
2000-01-01 00:30:00
2000-01-01 00:45:00
2000-01-01 01:00:00
2000-01-01 01:15:00
2000-01-01 01:30:00
2000-01-01 01:45:00
2000-01-01 02:00:00
file1.csv:
header,value1
2000-01-01 00:15:00,value12
2000-01-01 00:30:00,value13
2000-01-01 01:30:00,value17
2000-01-01 01:45:00,value18
2000-01-01 02:00:00,value19
file2.csv:
header,value2
2000-01-01 00:00:00,value21
2000-01-01 00:15:00,value22
2000-01-01 00:30:00,value23
2000-01-01 00:45:00,value24
file3.csv:
header,value3
2000-01-01 00:15:00,value32
2000-01-01 00:45:00,value34
2000-01-01 01:15:00,value36
2000-01-01 01:30:00,value37
2000-01-01 01:45:00,value38
output.csv:
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,
Upvotes: 2
Views: 8169
Reputation: 203229
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR { key[++numRows] = $1 }
{ fld[$1,ARGIND] = $NF }
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", fld[key[rowNr],colNr], (colNr<ARGIND ? OFS : ORS)
}
}
}
$ awk -f tst.awk template.csv file1.csv file2.csv file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,
The above uses GNU awk for ARGIND
, with other awks just add a line that says FNR==1 { ++ARGIND }
.
Upvotes: 3
Reputation: 44023
This should work (for explanation read the comments):
#!/bin/sh
awk -F, -v file=0 '
FNR == 1 { # first line in the file
if(file == 0) { # if in first file (template.csv):
header = $1 # init header
} else {
header = header "," $2 # else append field name
}
next # forward to next line.
}
file == 0 { # if in first file:
key[FNR] = $1 # remember key
next # next line.
}
{
field[$1][file] = $2 # otherwise: remember field
}
ENDFILE { # at the end of a file:
file = file + 1 # increase counter
}
END { # in the end, assemble and
print header # print lines.
asort(key)
for(k in key) {
line = ""
for(i = 1; i < file; ++i) {
line = line "," field[key[k]][i]
}
print key[k] line
}
}
' template.csv file1.csv file2.csv file3.csv
Upvotes: 1
Reputation: 284
You could use multiple calls to join
:
join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv
Or more clearer :
alias myjoin='join -t , -a 1 -o auto'
myjoin template.csv file1.csv | myjoin - file2.csv | myjoin - file3.csv
Explanation :
-t ,
specifies the field separator (,
)-a 1
instructs to print unpairable lines coming from the first file (an assumption is made that the header file contains all possible headers)-o auto
controls formatting and is necessary to print the empty fieldsProof :
$ join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,
Note :
For this to work, the files MUST be sorted on the join fields (the header in your case). You can use the sort
command if this is not the case.
Upvotes: 1
Reputation: 674
I would go with this, however it surely is not the fastest running solution, but for your data it returns correct result and code is short:
#!/bin/bash
CONTENT=$(cat template.scv)
for line in $CONTENT; do
TMP=$(echo $line)
for file in file1.csv file2.csv file3.csv; do
RESULT=$(grep "^$line," $file | cut -d',' -f2)
TMP=$(echo $TMP,$RESULT)
done
echo $TMP
done
output:
header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,
EDIT:
my code was missing a comma (,
), so for longer ids it did not work properly
EDIT 2:
Well it is not "not the fastest solution", it is really slow one
Upvotes: 0