Martin Muruetatorre
Martin Muruetatorre

Reputation: 13

GROUP BY CSV columns in bash

I am working with a .csv file in bash and I need to SUM the last value of each row depending on the previous fields. That is, I need to group in Bash by the first three columns.

Example of input file:

Barcelona, Female, suspect, 2
Barcelona, Female, positive, 3
Barcelona, Female, positive, 2
Barcelona, Male, positive, 1
Barcelona, Female, suspect, 5
Madrid, Male, positive, 3
Madrid, Male, positive, 1
Barcelona, Male, positive, 4
Madrid, Female, suspect, 2

Example of output file:

Barcelona, Female, suspect, 7
Barcelona, Female, positive, 5
Barcelona, Male, positive, 5
Barcelona, Female, suspect, 5
Madrid, Male, positive, 4
Madrid, Female, suspect, 2


Upvotes: 1

Views: 2641

Answers (3)

mtnezm
mtnezm

Reputation: 1027

I made this script:

#!/bin/bash

# Barcelona, Female, Suspect
bfs() {
        BFS_FILTER=$(egrep -i "[Bb]arcelona, [Ff]emale, [Ss]uspect" data.csv | awk '{ print $4 }')
        BFS_CASES=$(for ITEM in ${BFS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        BFS_SUM=$(if [ $(echo $BFS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BFS_CASES |wc -w) -eq 2 ]; then echo $BFS_CASES |awk '{ print $1 }'; else expr ${BFS_CASES}; fi; fi )

        echo "Barcelona, Female, suspect, $BFS_SUM"
}

# Barcelona, Male, Suspect
bms() {
        BMS_FILTER=$(egrep -i "[Bb]arcelona, [Mm]ale, [Ss]uspect" data.csv | awk '{ print $4 }')
        BMS_CASES=$(for ITEM in ${BMS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        BMS_SUM=$(if [ $(echo $BMS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BMS_CASES |wc -w) -eq 2 ]; then echo $BMS_CASES |awk '{ print $1 }'; else expr ${BMS_CASES}; fi; fi )

        echo "Barcelona, Male, suspect, $BMS_SUM"
}

# Barcelona, Female, Positive
bfp() {
        BFP_FILTER=$(egrep -i "[Bb]arcelona, [Ff]emale, [Pp]ositive" data.csv | awk '{ print $4 }')
        BFP_CASES=$(for ITEM in ${BFP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        BFP_SUM=$(if [ $(echo $BFP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BFP_CASES |wc -w) -eq 2 ]; then echo $BFP_CASES |awk '{ print $1 }'; else expr ${BFP_CASES}; fi; fi )

        echo "Barcelona, Female, positive, $BFP_SUM"
}

# Barcelona, Male, Positive
bmp() {
        BMP_FILTER=$(grep -i "[Bb]arcelona, [Mm]ale, [Pp]ositive" data.csv | awk '{ print $4 }')
        BMP_CASES=$(for ITEM in ${BMP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        BMP_SUM=$(if [ $(echo $BMP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BMP_CASES |wc -w) -eq 2 ]; then echo $BMP_CASES |awk '{ print $1 }'; else expr ${BMP_CASES}; fi; fi )

        echo "Barcelona, Male, positive, $BMP_SUM"
}

# Madrid, Female, Suspect
mfs() {
        MFS_FILTER=$(egrep -i "[Mm]adrid, [Ff]emale, [Ss]uspect" data.csv | awk '{ print $4 }')
        MFS_CASES=$(for ITEM in ${MFS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        MFS_SUM=$(if [ $(echo $MFS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MFS_CASES |wc -w) -eq 2 ]; then echo $MFS_CASES |awk '{ print $1 }'; else expr ${MFS_CASES}; fi; fi )

        echo "Madrid, Female, suspect, $MFS_SUM"
}

# Madrid, Male, Suspect
mms() {
        MMS_FILTER=$(egrep -i "[Mm]adrid, [Mm]ale, [Ss]uspect" data.csv | awk '{ print $4 }')
        MMS_CASES=$(for ITEM in ${MMS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        MMS_SUM=$(if [ $(echo $MMS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MMS_CASES |wc -w) -eq 2 ]; then echo $MMS_CASES |awk '{ print $1 }'; else expr ${MMS_CASES}; fi; fi )

        echo "Madrid, Male, suspect, $MMS_SUM"
}

# Madrid, Female, Positive
mfp() {
        MFP_FILTER=$(egrep -i "[Mm]adrid, [Ff]emale, [Pp]ositive" data.csv | awk '{ print $4 }')
        MFP_CASES=$(for ITEM in ${MFP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        MFP_SUM=$(if [ $(echo $MFP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MFP_CASES |wc -w) -eq 2 ]; then echo $MFP_CASES |awk '{ print $1 }'; else expr ${MFP_CASES}; fi; fi )

        echo "Madrid, Female, positive, $MFP_SUM"
}

# Madrid, Male, Positive
mmp() {
        MMP_FILTER=$(grep -i "[Mm]adrid, [Mm]ale, [Pp]ositive" data.csv | awk '{ print $4 }')
        MMP_CASES=$(for ITEM in ${MMP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
        MMP_SUM=$(if [ $(echo $MMP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MMP_CASES |wc -w) -eq 2 ]; then echo $MMP_CASES |awk '{ print $1 }'; else expr ${MMP_CASES}; fi; fi )

        echo "Madrid, Male, positive, $MMP_SUM"
}

bfs
bms
bfp
bmp
mfs
mms
mfp
mmp

It is based on data extracted from a file I called data.csv, so you will have to replace every data.csv coincidence on it with your file's name before running it to make it work. For example, if you save it as script.sh and your input file is called yourfile.csv, you could try:

sed -i "s|data.csv|yourfile.csv|g" script.sh

The reason to separate every output into different functions is that if you want to filter out some of them at any time you could do this with no extra effort, just comment the function's name and you are done. Also, you could add extra cases by simply copy-pasting the schema and replacing the new variable names properly.

Now the explanation:

The row of number cases filter is made via:

egrep -i "$CITY, $GENRE, $STATUS" data.csv | awk '{ print $4 }'

Then, the sum of every row found works like this:

for ITEM in ${SOME_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |" |sed "s/\+ $//g"

Where it outputs expr $NUM + $NUM ... in case there are more than 2 rows. But if there are not (which I detected there are cases that only shows positive number cases in only one row), I filter them with this method:

  • If there is only one row with more than one case, just print it
  • If there are multiple rows, sum each of them
  • If there are no cases but the function is called and the rows exist, print 0

Lastly, it also prints a line when there are no detected cases, just via echo 0. An example output:

Barcelona, Female, suspect, 7
Barcelona, Male, suspect, 0
Barcelona, Female, positive, 5
Barcelona, Male, positive, 5
Madrid, Female, suspect, 2
Madrid, Male, suspect, 0
Madrid, Female, positive, 0
Madrid, Male, positive, 4

Hope you find it useful.

Upvotes: 0

aborruso
aborruso

Reputation: 5688

Using Miller (https://github.com/johnkerl/miller) and running

mlr --csv -N stats1 -a sum -f 4 -g 1,2,3 input.csv

you have

Barcelona, Female, suspect,7
Barcelona, Female, positive,5
Barcelona, Male, positive,5
Madrid, Male, positive,4
Madrid, Female, suspect,2

Upvotes: 1

Shawn
Shawn

Reputation: 52409

GNU datamash is designed just for this sort of task:

datamash -t, -sg1,2,3 sum 4 < input.csv

Or with awk:

awk -F, '{ groups[$1 "," $2 "," $3] += $4}
         END { PROCINFO["sorted_in"] = "@ind_str_asc" # Sort output in GNU awk
               for (g in groups) print g "," groups[g] }' input.csv

Upvotes: 4

Related Questions