Reputation: 157
I've gotten >100 csv
files, each containing >1000 measurements structured like the following two example files
MR44825_radiomics_MCA.csv
Case-1_Image: MR44825_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -917.2822725885565
MR47987_radiomics_MCA.csv
Case-1_Image: MR47987_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -442.31589128260026
The label is always some string of varying length, the distincter to the measurement is always the first :
. Each measurement contains identical labels. The measurements themselves may contain ,
but then the related values are enclosed by {}
.
Now I want to merge these files, preferably using bash
. The output csv
should be structured like the following:
Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026
Upvotes: 1
Views: 140
Reputation: 189327
CSV doesn't really make sense for the data you presented, and the pseudo-CSV you say you want as output makes little sense and will be hard to process further. Perhaps converting each input file to JSON would make more sense, and allow for processing using standard tools.
awk -F ': ' 'FNR==1 { name=$2 }
FNR==2 { j = substr($0, length($1)+3); gsub(/\047/, "\042", j) }
FNR==3 { sub(/^{/, "{\042name\042: \042" name "\042,", j);
sub(/}$/, ",\042mean\042: " $2 "}", j);
print j }' *.csv >output.jsonl
The output should be something like
{"name":"MR20584_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean": -917.2822725885565}
{"name":"MR30211_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean":-1024.287275914652}
This format is JSON lines, i.e. each line is valid JSON, but the file itself isn't proper JSON.
Demo: https://ideone.com/ue0Crd
Of course, if you can fix the tool which generated this useless format in the first place, even better.
Upvotes: 0
Reputation: 34134
Assumptions:
,
) as both delimiter and data):
One awk
idea:
NOTE: a bit lengthy due to need to dynamically process labels
awk '
BEGIN { split("",hdr) # declare hdr as an array
split("",data) # declare data as an array
ndx=1 # init array index
}
function print_row() { # function to print a row
pfx="" # first column will have a NULL prefix
if ( length(hdr) > 0 ) # print the header row?
{ for ( i in hdr )
{ printf "%s%s", pfx, hdr[i]
pfx="," # 2nd-nth columns will have a "," prefix
}
printf "\n"
split("",hdr) # clear hdr[] array so we do not print it again
}
pfx="" # reset prefix for printing data row
if ( length(data) > 0 ) # print a data row?
{ for ( i in data )
{ printf "%s%s", pfx, data[i]
pfx="," # 2nd-nth columns will have a "," prefix
}
printf "\n"
split("",data) # clear the data[] array for the next file
ndx=1 # reset our array index for the next file
}
}
FNR==1 { print_row() } # if this is a new file then print contents of last file
{ if ( FNR==NR ) # if this is the first file then make sure to populate the hdr[] array
hdr[ndx]=gensub(/:$/,"","g",$1) # strip trailing ":" from field #1; store in hdr[] array
$1="" # clear field #1
data[ndx]=gensub(/^ /,"","g",$0) # strip leading " " from the line; store in data[] array
ndx++ # increment array index
next
}
END { print_row() } # flush last set of data[] to stdout
' MR*MCA.csv
When run against the 2x sample data files this generates:
Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026
Upvotes: 1