Reputation: 667
I have the following lines:
123;123;#rss
123;123;#site #design #rss
123;123;#rss
123;123;#rss
123;123;#site #design
and need to count how many times each tag appears, do the following:
grep -Eo '#[a-z].*' ./1.txt | tr "\ " "\n" | uniq -c
i.e. first select only the tags from the strings, and then break them down and count it.
output:
1 #rss
1 #site
1 #design
3 #rss
1 #site
1 #design
instead of the expected:
2 #site
4 #rss
2 #design
It seems that the problem is in the non-printable characters, which makes counting incorrect. Or is it something else? Can anyone suggest a correct solution?
Upvotes: 2
Views: 160
Reputation: 203493
$ cut -d';' -f3 file | tr ' ' '\n' | sort | uniq -c
2 #design
4 #rss
2 #site
Upvotes: 0
Reputation: 133508
With your shown samples only, could you please try following. Written and tested in GNU awk
.
awk '
{
while($0){
match($0,/#[^ ]*/)
count[substr($0,RSTART,RLENGTH)]++
$0=substr($0,RSTART+RLENGTH)
}
}
END{
for(key in count){
print count[key],key
}
}' Input_file
Output will be as follows.
2 #site
2 #design
4 #rss
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
while($0){ ##Running while till line value.
match($0,/#[^ ]*/) ##using match function to match regex #[^ ]* in current line.
count[substr($0,RSTART,RLENGTH)]++ ##Creating count array which has index as matched sub string and keep increasing its value with 1 here.
$0=substr($0,RSTART+RLENGTH) ##Putting rest of line after match into currnet line here.
}
}
END{ ##Starting END block of this program from here.
for(key in count){ ##using for loop to go throgh count here.
print count[key],key ##printing value of count which has index as key and key here.
}
}
' Input_file ##Mentioning Input_file name here.
Upvotes: 0
Reputation: 785128
It can be done in a single gnu awk
:
awk -v RS='#[a-zA-Z]+' 'RT {++freq[RT]} END {for (i in freq) print freq[i], i}' file
2 #site
2 #design
4 #rss
Or else a grep + awk
solution:
grep -iEo '#[a-z]+' file |
awk '{++freq[$1]} END {for (i in freq) print freq[i], i}'
2 #site
2 #design
4 #rss
Upvotes: 1
Reputation: 12877
Using awk as an alternative:
awk -F [" "\;] '{ for(i=3;i<=NF;i++) { map[$i]++ } } END { for (i in map) { print map[i]" "i} }' file
Set the field separator to a space or a ";" Then loop from the third field to the last field (NF), adding to an array map, with the field as the index and incrementing counter as the value. At the end of the file processing, loop through the map array and print the indexes/values.
Upvotes: 0
Reputation: 27215
uniq -c
works only on sorted input.
Also, you can drop the tr
by changing the regex to #[a-z]*
.
grep -Eo '#[a-z]*' ./1.txt | sort | uniq -c
prints
2 #design
4 #rss
2 #site
as expected.
Upvotes: 2