Merging rows in .csv in order

Question

After analysis of brain scans I ended up with around 1000 .csv files, one for each scan. I've merged them into one in order (by subject ID and date). My problem is, that some subjects had two or more consecutive scans and some had only one. Database now looks like that:

ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530          //first scan of A
024_S_0985, 437.50, 204.80, 0.131074          //second scan of A
024_S_0985, 400.75, 198.80, 0.127420          //third scan  of A
024_S_1063, 544.50, 214.34, 0.148939          //first and only scan of B
024_S_1171, 654.75, 240.33, 0.142453          //first scan of C
024_S_1171, 659.50, 242.21, 0.141269          //second scan of C
...

But I want it to look like that:

ID, CC_area, CC_perimeter, CC_circularity, CC_area2, CC_perimeter2, CC_circularity2, CC_area3, CC_perimeter3, CC_circularity3, ..., CC_circularity6
024_S_0985, 407.00, 192.15, 0.138530, 437.50, 204.80, 0.131074, 400.75, 198.80, 0.127420, ... , 
024_S_1063, 544.50, 214.34, 0.148939,,,,,, ..., 
024_S_1171, 654.75, 240.33, 0.142453, 659.50, 242.21, 0.141269,,, ... , 
...

What is important, that order of data must not be changed and number of rows for one ID is not known (it varies from 1 to 6). (So first columns of scan 1, then scan 2 etc.). Could you help me, or provide, with solution for that using bash? I am not experienced in programming and I have lost hope, that I could do it myself.

David C. Rankin · Accepted Answer

You can combine the line with the same filename (or initial index) using a normal while read loop and then acting on 3 conditions. (1) whether it is the first line following the header; (2) where the current index is equal to the last; and (3) where the current index differs from the last. There are a number of ways to approach this, but a short bash script could look like the following:

#!/bin/bash

fn="${1:-/dev/stdin}"       ## accept filename or stdin

[ -r "$fn" ] || {           ## validate file is readable
    printf "error: file not found: '%s'
" "$fn"
    exit 1
}

declare -i cnt=0            ## flag for 1st iteration

while read -r line; do      ## for each line in file

    ## read header, print & continue
    [ ${line//,*/} = ID ] && printf "%s
" "$line" && continue

    line="${line//  */}"            ## strip //first scan of A....
    idx=${line//,*/}                ## parse file index from line
    line="${line#*, }"              ## strip index

    if [ $cnt -eq 0 ]; then         ## if first line - print
        printf "%s, %s" "$idx" "$line"
        ((cnt++))
    elif [ $idx = $lidx ]; then     ## if indexes equal, append
        printf ", %s" "$line"
    else                            ## else, newline & print
        printf "
%s, %s" "$idx" "$line"
    fi

    last="$line"            ## save last line
    lidx=$idx               ## save last index

done <"$fn"

printf "
"

Input

$ cat dat/cmbcsv.dat
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530          //first scan of A
024_S_0985, 437.50, 204.80, 0.131074          //second scan of A
024_S_0985, 400.75, 198.80, 0.127420          //third scan  of A
024_S_1063, 544.50, 214.34, 0.148939          //first and only scan of B
024_S_1171, 654.75, 240.33, 0.142453          //first scan of C
024_S_1171, 659.50, 242.21, 0.141269          //second scan of C

Output

$ bash cmbcsv.sh dat/cmbcsv.dat
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530, 437.50, 204.80, 0.131074, 400.75, 198.80, 0.127420
024_S_1063, 544.50, 214.34, 0.148939
024_S_1171, 654.75, 240.33, 0.142453, 659.50, 242.21, 0.141269

Note: I didn't know whether you needed all the additional commas or ellipses or if they were just there to show there could be more of the same index (e.g. ,,...,). You can easily add them if need be.

Merging rows in .csv in order

Answers (2)

Related Questions