Reputation: 13
I am working with a .csv file in bash and I need to SUM the last value of each row depending on the previous fields. That is, I need to group in Bash by the first three columns.
Example of input file:
Barcelona, Female, suspect, 2
Barcelona, Female, positive, 3
Barcelona, Female, positive, 2
Barcelona, Male, positive, 1
Barcelona, Female, suspect, 5
Madrid, Male, positive, 3
Madrid, Male, positive, 1
Barcelona, Male, positive, 4
Madrid, Female, suspect, 2
Example of output file:
Barcelona, Female, suspect, 7
Barcelona, Female, positive, 5
Barcelona, Male, positive, 5
Barcelona, Female, suspect, 5
Madrid, Male, positive, 4
Madrid, Female, suspect, 2
Upvotes: 1
Views: 2641
Reputation: 1027
I made this script:
#!/bin/bash
# Barcelona, Female, Suspect
bfs() {
BFS_FILTER=$(egrep -i "[Bb]arcelona, [Ff]emale, [Ss]uspect" data.csv | awk '{ print $4 }')
BFS_CASES=$(for ITEM in ${BFS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
BFS_SUM=$(if [ $(echo $BFS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BFS_CASES |wc -w) -eq 2 ]; then echo $BFS_CASES |awk '{ print $1 }'; else expr ${BFS_CASES}; fi; fi )
echo "Barcelona, Female, suspect, $BFS_SUM"
}
# Barcelona, Male, Suspect
bms() {
BMS_FILTER=$(egrep -i "[Bb]arcelona, [Mm]ale, [Ss]uspect" data.csv | awk '{ print $4 }')
BMS_CASES=$(for ITEM in ${BMS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
BMS_SUM=$(if [ $(echo $BMS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BMS_CASES |wc -w) -eq 2 ]; then echo $BMS_CASES |awk '{ print $1 }'; else expr ${BMS_CASES}; fi; fi )
echo "Barcelona, Male, suspect, $BMS_SUM"
}
# Barcelona, Female, Positive
bfp() {
BFP_FILTER=$(egrep -i "[Bb]arcelona, [Ff]emale, [Pp]ositive" data.csv | awk '{ print $4 }')
BFP_CASES=$(for ITEM in ${BFP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
BFP_SUM=$(if [ $(echo $BFP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BFP_CASES |wc -w) -eq 2 ]; then echo $BFP_CASES |awk '{ print $1 }'; else expr ${BFP_CASES}; fi; fi )
echo "Barcelona, Female, positive, $BFP_SUM"
}
# Barcelona, Male, Positive
bmp() {
BMP_FILTER=$(grep -i "[Bb]arcelona, [Mm]ale, [Pp]ositive" data.csv | awk '{ print $4 }')
BMP_CASES=$(for ITEM in ${BMP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
BMP_SUM=$(if [ $(echo $BMP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BMP_CASES |wc -w) -eq 2 ]; then echo $BMP_CASES |awk '{ print $1 }'; else expr ${BMP_CASES}; fi; fi )
echo "Barcelona, Male, positive, $BMP_SUM"
}
# Madrid, Female, Suspect
mfs() {
MFS_FILTER=$(egrep -i "[Mm]adrid, [Ff]emale, [Ss]uspect" data.csv | awk '{ print $4 }')
MFS_CASES=$(for ITEM in ${MFS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
MFS_SUM=$(if [ $(echo $MFS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MFS_CASES |wc -w) -eq 2 ]; then echo $MFS_CASES |awk '{ print $1 }'; else expr ${MFS_CASES}; fi; fi )
echo "Madrid, Female, suspect, $MFS_SUM"
}
# Madrid, Male, Suspect
mms() {
MMS_FILTER=$(egrep -i "[Mm]adrid, [Mm]ale, [Ss]uspect" data.csv | awk '{ print $4 }')
MMS_CASES=$(for ITEM in ${MMS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
MMS_SUM=$(if [ $(echo $MMS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MMS_CASES |wc -w) -eq 2 ]; then echo $MMS_CASES |awk '{ print $1 }'; else expr ${MMS_CASES}; fi; fi )
echo "Madrid, Male, suspect, $MMS_SUM"
}
# Madrid, Female, Positive
mfp() {
MFP_FILTER=$(egrep -i "[Mm]adrid, [Ff]emale, [Pp]ositive" data.csv | awk '{ print $4 }')
MFP_CASES=$(for ITEM in ${MFP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
MFP_SUM=$(if [ $(echo $MFP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MFP_CASES |wc -w) -eq 2 ]; then echo $MFP_CASES |awk '{ print $1 }'; else expr ${MFP_CASES}; fi; fi )
echo "Madrid, Female, positive, $MFP_SUM"
}
# Madrid, Male, Positive
mmp() {
MMP_FILTER=$(grep -i "[Mm]adrid, [Mm]ale, [Pp]ositive" data.csv | awk '{ print $4 }')
MMP_CASES=$(for ITEM in ${MMP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
MMP_SUM=$(if [ $(echo $MMP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MMP_CASES |wc -w) -eq 2 ]; then echo $MMP_CASES |awk '{ print $1 }'; else expr ${MMP_CASES}; fi; fi )
echo "Madrid, Male, positive, $MMP_SUM"
}
bfs
bms
bfp
bmp
mfs
mms
mfp
mmp
It is based on data extracted from a file I called data.csv
, so you will have to replace every data.csv
coincidence on it with your file's name before running it to make it work. For example, if you save it as script.sh
and your input file is called yourfile.csv
, you could try:
sed -i "s|data.csv|yourfile.csv|g" script.sh
The reason to separate every output into different functions is that if you want to filter out some of them at any time you could do this with no extra effort, just comment the function's name and you are done. Also, you could add extra cases by simply copy-pasting the schema and replacing the new variable names properly.
Now the explanation:
The row of number cases filter is made via:
egrep -i "$CITY, $GENRE, $STATUS" data.csv | awk '{ print $4 }'
Then, the sum of every row found works like this:
for ITEM in ${SOME_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |" |sed "s/\+ $//g"
Where it outputs expr $NUM + $NUM ...
in case there are more than 2 rows. But if there are not (which I detected there are cases that only shows positive number cases in only one row), I filter them with this method:
Lastly, it also prints a line when there are no detected cases, just via echo 0
. An example output:
Barcelona, Female, suspect, 7
Barcelona, Male, suspect, 0
Barcelona, Female, positive, 5
Barcelona, Male, positive, 5
Madrid, Female, suspect, 2
Madrid, Male, suspect, 0
Madrid, Female, positive, 0
Madrid, Male, positive, 4
Hope you find it useful.
Upvotes: 0
Reputation: 5688
Using Miller (https://github.com/johnkerl/miller) and running
mlr --csv -N stats1 -a sum -f 4 -g 1,2,3 input.csv
you have
Barcelona, Female, suspect,7
Barcelona, Female, positive,5
Barcelona, Male, positive,5
Madrid, Male, positive,4
Madrid, Female, suspect,2
Upvotes: 1
Reputation: 52409
GNU datamash is designed just for this sort of task:
datamash -t, -sg1,2,3 sum 4 < input.csv
Or with awk
:
awk -F, '{ groups[$1 "," $2 "," $3] += $4}
END { PROCINFO["sorted_in"] = "@ind_str_asc" # Sort output in GNU awk
for (g in groups) print g "," groups[g] }' input.csv
Upvotes: 4