Reputation: 182
I want to count number of lines in a document and group it by the prefix word. Prefix is a set of alphanumeric characters delimited by first underscore. I don't care much about sorting them but it would be nice to list them descending by number of occurrences.
The file looks like this:
prefix1_data1
prefix1_data2_a
differentPrefix_data3
prefix1_data2_b
differentPrefix_data5
prefix2_data4
differentPrefix_data5
The output should be the following:
prefix1 3
differentPrefix 3
prefix2 1
I already did this in python but I am curious if it is possible to do this more efficient using command line or bash script? uniq
command has -c
and -w
options but the length of prefix may vary.
Upvotes: 6
Views: 6586
Reputation: 640
I like RomanPerekhrest's answer. It's more concise. Here is a small change to make it even more concise by using cut in place of sed.
cut -d_ -f1 testfile | sort | uniq -c
Upvotes: 2
Reputation: 92854
The solution using combination of sed
, sort
and uniq
commands:
sed -rn 's/^([^_]+)_.*/\1/p' testfile | sort | uniq -c
The output:
3 differentPrefix
3 prefix1
1 prefix2
^([^_]+)_
- matches a sub-string(prefix, containing any characters except _
) from the start of the string to the first occurrence of underscore _
Upvotes: 7
Reputation: 1442
Can be done in following manner, testfile is file with contents mentioned above.
printf %-20s%d"\n" prefix1 $(cat testfile|grep "^prefix1" | wc -l)
printf %-20s%d"\n" differentPrefix $(cat testfile|grep "^differentPrefix" | wc -l)
printf %-20s%d"\n" prefix2 $(cat testfile|grep "^prefix2" | wc -l)
so you can check this with your code and check which one's more efficient.
Upvotes: 0
Reputation: 13249
You could use awk
:
awk -F_ '{a[$1]++}END{for(i in a) print i,a[i]}' file
The field separator is set to _
.
An array a
is filled with all first element, with their associated count.
When the file is parsed the array content is printed
Upvotes: 7