Adrian Mak
Adrian Mak

Reputation: 157

Combining multiple csv files in bash, splitting the lines into different rows

I've gotten >100 csv files, each containing >1000 measurements structured like the following two example files

MR44825_radiomics_MCA.csv

Case-1_Image: MR44825_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -917.2822725885565

MR47987_radiomics_MCA.csv

Case-1_Image: MR47987_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -442.31589128260026

The label is always some string of varying length, the distincter to the measurement is always the first :. Each measurement contains identical labels. The measurements themselves may contain , but then the related values are enclosed by {}.

Now I want to merge these files, preferably using bash. The output csv should be structured like the following:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026

Upvotes: 1

Views: 140

Answers (2)

tripleee
tripleee

Reputation: 189327

CSV doesn't really make sense for the data you presented, and the pseudo-CSV you say you want as output makes little sense and will be hard to process further. Perhaps converting each input file to JSON would make more sense, and allow for processing using standard tools.

awk -F ': ' 'FNR==1 { name=$2 }
    FNR==2 { j = substr($0, length($1)+3); gsub(/\047/, "\042", j) }
    FNR==3 { sub(/^{/, "{\042name\042: \042" name "\042,", j);
        sub(/}$/, ",\042mean\042: " $2 "}", j);
        print j }' *.csv >output.jsonl

The output should be something like

{"name":"MR20584_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean": -917.2822725885565}
{"name":"MR30211_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean":-1024.287275914652}

This format is JSON lines, i.e. each line is valid JSON, but the file itself isn't proper JSON.

Demo: https://ideone.com/ue0Crd

Of course, if you can fix the tool which generated this useless format in the first place, even better.

Upvotes: 0

markp-fuso
markp-fuso

Reputation: 34134

Assumptions:

  • OP has a reason for using a non-JSON format (and OP is 'ok' with having commas (,) as both delimiter and data)
  • all source files have the same number of lines
  • there are no blank lines in any of the source files
  • all source files have the same labels preceding the first :
  • all source files have their labels in the same order
  • the number, and spelling, of labels is not known up front (ie, we'll need to dynamically parse, store and print the labels)

One awk idea:

NOTE: a bit lengthy due to need to dynamically process labels

awk '
BEGIN  { split("",hdr)                        # declare hdr as an array
         split("",data)                       # declare data as an array
         ndx=1                                # init array index
       }

function print_row() {                        # function to print a row

pfx=""                                        # first column will have a NULL prefix

if ( length(hdr) > 0  )                       # print the header row?
   { for ( i in hdr )
       { printf "%s%s", pfx, hdr[i]
         pfx=","                              # 2nd-nth columns will have a "," prefix
       }
     printf "\n"
     split("",hdr)                            # clear hdr[] array so we do not print it again
   }

pfx=""                                        # reset prefix for printing data row

if ( length(data) > 0 )                       # print a data row?
   { for ( i in data )
         { printf "%s%s", pfx, data[i]
           pfx=","                            # 2nd-nth columns will have a "," prefix
         }
     printf "\n"
     split("",data)                           # clear the data[] array for the next file
     ndx=1                                    # reset our array index for the next file
   }
}

FNR==1 { print_row() }                        # if this is a new file then print contents of last file

       { if ( FNR==NR )                       # if this is the first file then make sure to populate the hdr[] array
            hdr[ndx]=gensub(/:$/,"","g",$1)   # strip trailing ":" from field #1; store in hdr[] array
         $1=""                                # clear field #1
         data[ndx]=gensub(/^ /,"","g",$0)     # strip leading " " from the line; store in data[] array
         ndx++                                # increment array index
         next
       }

END    { print_row() }                        # flush last set of data[] to stdout

' MR*MCA.csv

When run against the 2x sample data files this generates:

Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026

Upvotes: 1

Related Questions