Reputation: 126
I'm doing some data gathering on massive log files and I need to count the occurrences of unique strings. Generally the way this is done is with a command like:
zcat <file> | grep -o <filter> | sort | uniq -c | sort -n
What I'm looking to do is not pay the performance penalty of the sort after the grep. Is this possible to do without leaving bash?
Upvotes: 3
Views: 1091
Reputation: 785008
You can use awk to count the uniques and avoid sort
:
zgrep -o <filter> <file> |
awk '{count[$0]++} END{for (i in count) print count[i], i}'
Also note you can avoid zcat
and call zgrep
directly.
Upvotes: 5
Reputation: 116700
jq has built-in associative arrays, so you could consider one of the following approaches, which are both efficient (like awk):
zgrep -o <filter> <file> |
jq -nR 'reduce inputs as $line ({}; .[$line] += 1)'
This would produce the results as a JSON object with the frequencies as the object's values, e.g.
{
"a": 2,
"b": 1,
"c": 1
}
If you want each line of output to consist of a count and value (in that order), then an appropriate jq invocation would be:
jq -nRr 'reduce inputs as $line ({}; .[$line] += 1)
| to_entries[] | "\(.value) \(.key)"'
This would produce output like so:
2 a
1 b
1 c
The jq
options used here are:
-n # for use with `inputs`
-R # "raw" input
-r # "raw" output
Upvotes: 1
Reputation: 5939
Since you mentioned you don't want to leave bash: You could try it using associative arrays: You could use the input lines as key, and the count as value. To learn about associative arrays see here: http://www.gnu.org/software/bash/manual/html_node/Arrays.html.
But, be sure to benchmark the performance - you may nevertheless be better off using sort and uniq, or perl, or ...
Upvotes: 1