ralar
ralar

Reputation: 126

Counting occurrences of unique strings in bash without first sorting the data

I'm doing some data gathering on massive log files and I need to count the occurrences of unique strings. Generally the way this is done is with a command like:

zcat <file> | grep -o <filter> | sort | uniq -c | sort -n

What I'm looking to do is not pay the performance penalty of the sort after the grep. Is this possible to do without leaving bash?

Upvotes: 3

Views: 1091

Answers (3)

anubhava
anubhava

Reputation: 785008

You can use awk to count the uniques and avoid sort:

zgrep -o <filter> <file> |
awk '{count[$0]++} END{for (i in count) print count[i], i}'

Also note you can avoid zcat and call zgrep directly.

Upvotes: 5

peak
peak

Reputation: 116700

jq has built-in associative arrays, so you could consider one of the following approaches, which are both efficient (like awk):

zgrep -o <filter> <file> |
  jq -nR 'reduce inputs as $line ({}; .[$line] += 1)'

This would produce the results as a JSON object with the frequencies as the object's values, e.g.

{
  "a": 2,
  "b": 1,
  "c": 1
}

If you want each line of output to consist of a count and value (in that order), then an appropriate jq invocation would be:

jq -nRr 'reduce inputs as $line ({}; .[$line] += 1)
         | to_entries[] | "\(.value) \(.key)"'

This would produce output like so:

2 a
1 b
1 c

The jq options used here are:

-n # for use with `inputs`
-R # "raw" input
-r # "raw" output

Upvotes: 1

Dirk Herrmann
Dirk Herrmann

Reputation: 5939

Since you mentioned you don't want to leave bash: You could try it using associative arrays: You could use the input lines as key, and the count as value. To learn about associative arrays see here: http://www.gnu.org/software/bash/manual/html_node/Arrays.html.

But, be sure to benchmark the performance - you may nevertheless be better off using sort and uniq, or perl, or ...

Upvotes: 1

Related Questions