Bhaskar Mishra
Bhaskar Mishra

Reputation: 3542

group by and filter the result

I have a file separated by pipe, i have to do a group by a field and get its sum of occurrences.

My input file looks like:

96472|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
96472|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
96472|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12

This is how I am doing it:

cut -d'|' -f1 somefile.txt | cut -d'-' -f1 | sort | uniq -c 
output is 
 3 96472
 10 214126

Essentially I want to sum up the occurrence of a field, just like group by clause in sql. so in my example i show that field/column 1 having srepeated values as 3 and 10

I am sure there is a better way to do it. And i also want to filter records which has less the 10 occurrences i do :

cut -d'|' -f1 somefile.txt | cut -d'-' -f1 | sort | uniq -c | grep -v 10

is there a good way to achieve both?

Upvotes: 2

Views: 1211

Answers (3)

Lino
Lino

Reputation: 6160

Just an answer that extends your command:

cut -d'|' -f1 somefile.txt | cut -d'-' -f2 | sort | uniq -c | awk '{ if ($1 < 10) print $0  }'

Upvotes: 0

sameerkn
sameerkn

Reputation: 2259

Suppose you data is in file txt.

sort -t '|' -k 1 txt | uniq -c | awk -F"|" '{print $1}' | awk '{if($1 < 10) print $0}'

sort command will:

  • tokenize (-t '|') the data on '|' character and then
  • choose the first token (-k 1) for sorting

Upvotes: 0

Inian
Inian

Reputation: 85770

A simple awk logic could be sufficient for this, than to use other utilities. For your input file in question, the output is as follows;

awk -F"|" '{count[$1]++}END{for (i in count) print count[i],i}' file
3 96472
10 214126

The idea is count[$1]++ increments the occurrence of $1 in the file and once the file is processed, the END clause prints out the total count of each of the unique fields in $1

Another filter to list only those count less than 10

awk -F"|" '{count[$1]++}END{for (i in count) if (count[i] < 10){print count[i],i}}' file
3 96472

Upvotes: 6

Related Questions