Reputation: 177
I want to count the number of elements with the value Z.
I will give an example of what I need.
I have a huge number of tags (millions) stored in a data frame and they are in the form < X >.
as shown in the figure below. I need to get the count of each tag to find the top 10 tags (The tags that were mentioned the most).
The example :
The output I need is :
Note : Im beginner in R, so I need the simplest way.
My attempt : I tried the function table() but it looks like it works for numbers. I tried group_by() and I did not get the result that want. sample of data set :
DF <- data.frame(Tag=c("<C++><Java>","<java><python><javascript>","<java><C++>","<Json><PHP>","<PHP><Java>"))
Tag
1 <C++><Java>
2 <java><python><javascript>
3 <java><C++>
4 <Json><PHP>
5 <PHP><Java>
Upvotes: 0
Views: 190
Reputation: 50678
Here is a solution in base R; in my opinion, there's no need for a load of additional libraries in this case.
table(unlist(strsplit(gsub("(^<|>$)", "", DF$Tag), "><")));
# C++ java Java javascript Json PHP python
# 2 2 2 1 1 2 1
Or if you want to ignore capitalisation you could convert all tags to lower-case:
table(tolower(unlist(strsplit(gsub("(^<|>$)", "", DF$Tag), "><"))));
# c++ java javascript json php python
# 2 4 1 1 2 1
As data.frame
as.data.frame(table(tolower(unlist(strsplit(gsub("(^<|>$)", "", DF$Tag), "><")))))
# Var1 Freq
#1 c++ 2
#2 java 4
#3 javascript 1
#4 json 1
#5 php 2
#6 python 1
Explanation: Remove "<"
and ">"
from beginning and end, respectively; strsplit
on "><"
, and use table
to count occurrences.
Upvotes: 1
Reputation: 79238
library(tidyverse)
library(tidytext)
DF%>%mutate(Freq=1)%>%
unnest_tokens(Tag,Tag,"regex",pattern="<|>|\\n")%>%group_by(Tag)%>%
summarise(count=n())%>%arrange(desc(count))
# A tibble: 6 x 2
Tag count
<chr> <int>
1 java 4
2 c++ 2
3 php 2
4 javascript 1
5 json 1
6 python 1
To do this in base R:
You will need to split
, then trim
the whitespaces and table and sort in a descending order
sort(table(trimws(unlist(strsplit(gsub("<(.*?)>(R?)","\\U\\1 ",DF$Tag,perl = T)," ")))),T)
JAVA C++ PHP JAVASCRIPT JSON PYTHON
4 2 2 1 1 1
Upvotes: 1
Reputation: 23574
Using the stringi
, dplyr
, and tidytext
packages, you could do the following. You can extract the computer language names with stri_extract_all_regex()
and split each string and create a data frame with unnest_tokens()
. Then, you count how many times each language appeared in the data set.
DF %>%
unnest_tokens(input = Tag, output = language, token = stri_extract_all_regex,
pattern = "(?<=\\<)[^<>]*(?=\\>)", to_lower = TRUE) %>%
count(language, sort = TRUE)
language n
<chr> <int>
1 java 4
2 c++ 2
3 php 2
4 javascript 1
5 json 1
6 python 1
Upvotes: 2