james joyce
james joyce

Reputation: 493

Count the Occurence of word,Total words and total Unique words in R

I have a huge df which has a doc_id and word, and every word can contain multiple class(Class_1,Class_2,Class_3 ) so if a word is in that class i put 1 there or if not then 0

SAMPLE DF

doc_id   word       Class_1   Class_2   Class_3 
104      saturn       1         0         1
104      survival     1         1         0
104      saturn       1         0         1
104      car          0         1         0  
118      baseball     1         1         0
118      color        0         0         1
118      muscle       0         1         0
187      image        1         0         0
187      pulled       0         0         0
187      game         1         0         1
187      play         0         0         1 
187      game         1         1         0 
125      translation  1         0         0  
125      survival     0         1         0
125      input        1         0         1      
125      excellent    1         0         0 
142      nice         0         1         0
142      article      0         1         0 
142      original     1         0         1
142      content      0         1         0

Now using this sample DF i want to count number of Occurrences of word in class(Class_1,Class_2,Class_3).
Total words in each class(Class_1,Class_2,Class_3), eg: like how many words are there in Class_1
and lastly total unique words in all documents.

OUTPUT DF should be something like this

doc_id   word       Occ_1  Occ_2  Occ_3  Totl_1  Totl_2  Totl_3  Unique_words 
104      saturn       2      0      2      11     9       7       17
104      survival     1      2      0      11     9       7       17
104      car          0      1      0      11     9       7       17
118      baseball     1      1      0      11     9       7       17
118      color        0      0      1      11     9       7       17
118      muscle       0      1      0      11     9       7       17
187      image        1      0      0      11     9       7       17
187      pulled       0      0      0      11     9       7       17  
187      game         2      1      1      11     9       7       17
187      play         0      0      1      11     9       7       17
125      translation  1      0      0      11     9       7       17 
125      input        1      0      1      11     9       7       17
125      excellent    1      0      0      11     9       7       17
142      nice         0      1      0      11     9       7       17
142      article      0      1      0      11     9       7       17 
142      original     1      0      1      11     9       7       17
142      content      0      1      0      11     9       7       17

Whereas
Occ_1 = Number of Occurrences of Word in Class_1 and same for other Class_2and Class_3
Totl_1 = Total Words in Class_1 and same for other Class_2and Class_3
Unique_words = Number of Total Unique Words in All Documents

Upvotes: 0

Views: 397

Answers (2)

Ric S
Ric S

Reputation: 9247

Using dplyr, you can run the following lines:

library(dplyr)

data %>%
  group_by(word) %>%
  summarise(
    doc_id = first(doc_id),
    Occ_1 = sum(Class_1),
    Occ_2 = sum(Class_2),
    Occ_3 = sum(Class_3)
  ) %>%
  arrange(doc_id, word) %>%
  mutate(
    Totl_1 = sum(Occ_1),
    Totl_2 = sum(Occ_2),
    Totl_3 = sum(Occ_3),
    Unique_words = n()
  )

Output

   word        doc_id Occ_1 Occ_2 Occ_3 Totl_1 Totl_2 Totl_3 Unique_words
   <chr>       <chr>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>        <int>
 1 car         104        0     1     0     11      9      7           17
 2 saturn      104        2     0     2     11      9      7           17
 3 survival    104        1     2     0     11      9      7           17
 4 baseball    118        1     1     0     11      9      7           17
 5 color       118        0     0     1     11      9      7           17
 6 muscle      118        0     1     0     11      9      7           17
 7 excellent   125        1     0     0     11      9      7           17
 8 input       125        1     0     1     11      9      7           17
 9 translation 125        1     0     0     11      9      7           17
10 article     142        0     1     0     11      9      7           17
11 content     142        0     1     0     11      9      7           17
12 nice        142        0     1     0     11      9      7           17
13 original    142        1     0     1     11      9      7           17
14 game        187        2     1     1     11      9      7           17
15 image       187        1     0     0     11      9      7           17
16 play        187        0     0     1     11      9      7           17
17 pulled      187        0     0     0     11      9      7           17

I have added an arrange function in order to sort your dataset by doc_id and word, otherwise the output dataset would have been sorted alphabetically by word.

Upvotes: 2

hello_friend
hello_friend

Reputation: 5788

Install these packages:

necessary_packages <-
  c("dplyr", "tiydr")
new_packages <-
  necessary_packages[!(necessary_packages %in% installed.packages()[, "Package"])]
if (length(new_packages) > 0) {
  install.packages(new_packages, dependencies = TRUE)
}
lapply(necessary_packages, require, character.only = TRUE)

Now lets count the words and reshape your df:

df <- 
df %>%
gather("class", "n", 3:6) %>%
group_by(word, class) %>%
mutate(occ = sum(n)) %>%
ungroup() %>%
group_by(class) %>%
mutate(class_totl = sum(n)) %>%
ungroup() %>% 
mutate(Unique_words = sum(n)) %>%
select(doc_id, word, occ, class_total, Unique_words) %>%
gather(variable, value, 3:6) %>%
spread(variable, value)

Note: I haven't run the above code, because you didn't provide code to build your df.

Upvotes: 1

Related Questions