Reputation: 851

Parsing a .csv-like file in bash

I have a file formatted as follows:

string1,string2,string3,...
...

I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:

"number of occurrences of x",x
"number of occurrences of y",y        
...

I managed to write the following script, that works fine:

#!/bin/bash

> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
    if [[ "$line" =~ $regExp ]]
    then
        printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
    fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"

My question is: There is a better and simpler way to do the job?

In particular I don't know how to fix that:

gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'

The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string. Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.

Thank very much, Goodbye

EDIT:

As asked, here there is some sample data:

(It is an exercise, sorry for the inventive)

Input:

*,*,*
test,  test  ,test
prova, * , prova
test,test,test
prova,  prova   ,prova
leonardo,da vinci,leonardo
in,o    u   t   ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o    u   t   ,pr
test,  test  ,test
,   tabs    ,
,   tabs    ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
,   tabs    ,

Output:

3, * 
4,*
4,da vinci
2,o u   t   
3,po
1,  prova   
3, spaces 
3,  tabs    
1,test
2,  test

Upvotes: 5

Answers (3)

Chris Koknat

Reputation: 3451

Here is a Perl one-liner, similar to Filipe's awk solution:

perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv

The output is sorted alphabetically according to the second column.
The @F autosplit array starts at index $F[0] while awk fields start with $1

Upvotes: 0

Filipe Gonçalves

Reputation: 21223

A one-liner in awk:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv

It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.

To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:

awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2

The only condition, of course, is that the 2nd column of each line doesn't contain a ,

Upvotes: 5

meuh

Reputation: 12255

You can make your final awk:

gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'

or use sed for this sort of thing:

sed 's/ *\([0-9]*\) /\1,/'

Upvotes: 1

Parsing a .csv-like file in bash

Answers (3)

Related Questions