user8863554
user8863554

Reputation: 177

counting the number of elements with the value Z in data frame in r

I want to count the number of elements with the value Z.

I will give an example of what I need.

I have a huge number of tags (millions) stored in a data frame and they are in the form < X >.

as shown in the figure below. I need to get the count of each tag to find the top 10 tags (The tags that were mentioned the most).

The example :

enter image description here

The output I need is :

enter image description here

Note : Im beginner in R, so I need the simplest way.

My attempt : I tried the function table() but it looks like it works for numbers. I tried group_by() and I did not get the result that want. sample of data set :

DF <- data.frame(Tag=c("<C++><Java>","<java><python><javascript>","<java><C++>","<Json><PHP>","<PHP><Java>"))

                         Tag
1                <C++><Java>
2 <java><python><javascript>
3                <java><C++>
4                <Json><PHP>
5                <PHP><Java>

Upvotes: 0

Views: 190

Answers (3)

Maurits Evers
Maurits Evers

Reputation: 50678

Here is a solution in base R; in my opinion, there's no need for a load of additional libraries in this case.

table(unlist(strsplit(gsub("(^<|>$)", "", DF$Tag), "><")));
#       C++       java       Java javascript       Json        PHP     python
#         2          2          2          1          1          2          1

Or if you want to ignore capitalisation you could convert all tags to lower-case:

table(tolower(unlist(strsplit(gsub("(^<|>$)", "", DF$Tag), "><"))));
#       c++       java javascript       json        php     python
#         2          4          1          1          2          1

As data.frame

as.data.frame(table(tolower(unlist(strsplit(gsub("(^<|>$)", "", DF$Tag), "><")))))
#        Var1 Freq
#1        c++    2
#2       java    4
#3 javascript    1
#4       json    1
#5        php    2
#6     python    1

Explanation: Remove "<" and ">" from beginning and end, respectively; strsplit on "><", and use table to count occurrences.

Upvotes: 1

Onyambu
Onyambu

Reputation: 79238

library(tidyverse)
library(tidytext)
DF%>%mutate(Freq=1)%>%
     unnest_tokens(Tag,Tag,"regex",pattern="<|>|\\n")%>%group_by(Tag)%>%
     summarise(count=n())%>%arrange(desc(count))
# A tibble: 6 x 2
         Tag count
       <chr> <int>
1       java     4
2        c++     2
3        php     2
4 javascript     1
5       json     1
6     python     1

To do this in base R: You will need to split, then trim the whitespaces and table and sort in a descending order

sort(table(trimws(unlist(strsplit(gsub("<(.*?)>(R?)","\\U\\1 ",DF$Tag,perl = T)," ")))),T)

      JAVA        C++        PHP JAVASCRIPT       JSON     PYTHON 
         4          2          2          1          1          1 

Upvotes: 1

jazzurro
jazzurro

Reputation: 23574

Using the stringi, dplyr, and tidytext packages, you could do the following. You can extract the computer language names with stri_extract_all_regex() and split each string and create a data frame with unnest_tokens(). Then, you count how many times each language appeared in the data set.

DF %>%
unnest_tokens(input = Tag, output = language, token = stri_extract_all_regex,
              pattern = "(?<=\\<)[^<>]*(?=\\>)", to_lower = TRUE) %>%
count(language, sort = TRUE)

  language       n
  <chr>      <int>
1 java           4
2 c++            2
3 php            2
4 javascript     1
5 json           1
6 python         1

Upvotes: 2

Related Questions