Combining multiple csv files in bash, splitting the lines into different rows

Question

I've gotten >100 csv files, each containing >1000 measurements structured like the following two example files

MR44825_radiomics_MCA.csv

Case-1_Image: MR44825_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -917.2822725885565

MR47987_radiomics_MCA.csv

Case-1_Image: MR47987_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -442.31589128260026

The label is always some string of varying length, the distincter to the measurement is always the first :. Each measurement contains identical labels. The measurements themselves may contain , but then the related values are enclosed by {}.

Now I want to merge these files, preferably using bash. The output csv should be structured like the following:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026

markp-fuso · Accepted Answer

Assumptions:

OP has a reason for using a non-JSON format (and OP is 'ok' with having commas (,) as both delimiter and data)
all source files have the same number of lines
there are no blank lines in any of the source files
all source files have the same labels preceding the first :
all source files have their labels in the same order
the number, and spelling, of labels is not known up front (ie, we'll need to dynamically parse, store and print the labels)

One awk idea:

NOTE: a bit lengthy due to need to dynamically process labels

awk '
BEGIN  { split("",hdr)                        # declare hdr as an array
         split("",data)                       # declare data as an array
         ndx=1                                # init array index
       }

function print_row() {                        # function to print a row

pfx=""                                        # first column will have a NULL prefix

if ( length(hdr) > 0  )                       # print the header row?
   { for ( i in hdr )
       { printf "%s%s", pfx, hdr[i]
         pfx=","                              # 2nd-nth columns will have a "," prefix
       }
     printf "
"
     split("",hdr)                            # clear hdr[] array so we do not print it again
   }

pfx=""                                        # reset prefix for printing data row

if ( length(data) > 0 )                       # print a data row?
   { for ( i in data )
         { printf "%s%s", pfx, data[i]
           pfx=","                            # 2nd-nth columns will have a "," prefix
         }
     printf "
"
     split("",data)                           # clear the data[] array for the next file
     ndx=1                                    # reset our array index for the next file
   }
}

FNR==1 { print_row() }                        # if this is a new file then print contents of last file

       { if ( FNR==NR )                       # if this is the first file then make sure to populate the hdr[] array
            hdr[ndx]=gensub(/:$/,"","g",$1)   # strip trailing ":" from field #1; store in hdr[] array
         $1=""                                # clear field #1
         data[ndx]=gensub(/^ /,"","g",$0)     # strip leading " " from the line; store in data[] array
         ndx++                                # increment array index
         next
       }

END    { print_row() }                        # flush last set of data[] to stdout

' MR*MCA.csv

When run against the 2x sample data files this generates:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026

Combining multiple csv files in bash, splitting the lines into different rows

Answers (2)

Related Questions