Reputation: 667

Word count and it output

I have the following lines:

123;123;#rss
123;123;#site #design #rss
123;123;#rss
123;123;#rss
123;123;#site #design

and need to count how many times each tag appears, do the following:

grep -Eo '#[a-z].*' ./1.txt | tr "\ " "\n" | uniq -c

i.e. first select only the tags from the strings, and then break them down and count it.

output:

   1 #rss
   1 #site
   1 #design
   3 #rss
   1 #site
   1 #design

instead of the expected:

   2 #site
   4 #rss
   2 #design

It seems that the problem is in the non-printable characters, which makes counting incorrect. Or is it something else? Can anyone suggest a correct solution?

Upvotes: 2

Answers (5)

Ed Morton

Reputation: 203493

$ cut -d';' -f3 file | tr ' ' '\n' | sort | uniq -c
      2 #design
      4 #rss
      2 #site

Upvotes: 0

RavinderSingh13

Reputation: 133508

With your shown samples only, could you please try following. Written and tested in GNU awk.

awk '
{
  while($0){
    match($0,/#[^ ]*/)
    count[substr($0,RSTART,RLENGTH)]++
    $0=substr($0,RSTART+RLENGTH)
  }
}
END{
  for(key in count){
    print count[key],key
  }
}' Input_file

Output will be as follows.

2 #site
2 #design
4 #rss

Explanation: Adding detailed explanation for above.

awk '                                     ##Starting awk program from here.
{
  while($0){                              ##Running while till line value.
    match($0,/#[^ ]*/)                    ##using match function to match regex #[^ ]* in current line.
    count[substr($0,RSTART,RLENGTH)]++    ##Creating count array which has index as matched sub string and keep increasing its value with 1 here.
    $0=substr($0,RSTART+RLENGTH)          ##Putting rest of line after match into currnet line here.
  }
}
END{                                      ##Starting END block of this program from here.
  for(key in count){                      ##using for loop to go throgh count here.
    print count[key],key                  ##printing value of count which has index as key and key here.
  }
}
' Input_file                              ##Mentioning Input_file name here.

Upvotes: 0

anubhava

Reputation: 785128

It can be done in a single gnu awk:

awk -v RS='#[a-zA-Z]+' 'RT {++freq[RT]} END {for (i in freq) print freq[i], i}' file

2 #site
2 #design
4 #rss

Or else a grep + awk solution:

grep -iEo '#[a-z]+' file |
awk '{++freq[$1]} END {for (i in freq) print freq[i], i}'

2 #site
2 #design
4 #rss

Upvotes: 1

Raman Sailopal

Reputation: 12877

Using awk as an alternative:

awk -F [" "\;] '{ for(i=3;i<=NF;i++) {  map[$i]++ } } END { for (i in map) { print map[i]" "i} }' file

Set the field separator to a space or a ";" Then loop from the third field to the last field (NF), adding to an array map, with the field as the index and incrementing counter as the value. At the end of the file processing, loop through the map array and print the indexes/values.

Upvotes: 0

Socowi

Reputation: 27215

uniq -c works only on sorted input.
Also, you can drop the tr by changing the regex to #[a-z]*.

grep -Eo '#[a-z]*' ./1.txt | sort | uniq -c

prints

  2 #design
  4 #rss
  2 #site

as expected.

Upvotes: 2

Word count and it output

Answers (5)

Related Questions