Reputation: 3542
I have a file separated by pipe, i have to do a group by a field and get its sum of occurrences.
My input file looks like:
96472|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
96472|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
96472|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
214126|Text1|6|A|City|Austin, TX|0123|9899|2017-02-12
This is how I am doing it:
cut -d'|' -f1 somefile.txt | cut -d'-' -f1 | sort | uniq -c
output is
3 96472
10 214126
Essentially I want to sum up the occurrence of a field, just like group by clause in sql. so in my example i show that field/column 1 having srepeated values as 3 and 10
I am sure there is a better way to do it. And i also want to filter records which has less the 10 occurrences i do :
cut -d'|' -f1 somefile.txt | cut -d'-' -f1 | sort | uniq -c | grep -v 10
is there a good way to achieve both?
Upvotes: 2
Views: 1211
Reputation: 6160
Just an answer that extends your command:
cut -d'|' -f1 somefile.txt | cut -d'-' -f2 | sort | uniq -c | awk '{ if ($1 < 10) print $0 }'
Upvotes: 0
Reputation: 2259
Suppose you data is in file txt
.
sort -t '|' -k 1 txt | uniq -c | awk -F"|" '{print $1}' | awk '{if($1 < 10) print $0}'
sort
command will:
-t '|'
) the data on '|'
character and then -k 1
) for sortingUpvotes: 0
Reputation: 85770
A simple awk
logic could be sufficient for this, than to use other utilities. For your input file in question, the output is as follows;
awk -F"|" '{count[$1]++}END{for (i in count) print count[i],i}' file
3 96472
10 214126
The idea is count[$1]++
increments the occurrence of $1
in the file and once the file is processed, the END
clause prints out the total count of each of the unique fields in $1
Another filter to list only those count less than 10
awk -F"|" '{count[$1]++}END{for (i in count) if (count[i] < 10){print count[i],i}}' file
3 96472
Upvotes: 6