Fabian
Fabian

Reputation: 13

Count number of occurrences of words from a list in a data frame in R

I am working on a little program in R which allows me to count the number of occurrences from a list in a data frame.

I therefore import my data frame and my word list as follows.

df <- read.csv("tweets.csv")
wordlist <- read.csv("wordlist.csv")

My idea was to use a "for"-loop which runs through all the words in the wordlist, counts their occurrences in the df data frame and then adds the number to the existing wordlist.

for (id in wordlist) 
{
wordlist$frequency <- sum(stri_detect_fixed(df$text, wordlist$word))
}

Clearly this doesn't work. Instead it adds the frequencies of ALL words in my wordlist to each of the words in the wordlist data frame which looks something like:

id  word     freuquency
1   the      1290
2   answer   1290
3   is       1290
4   wrong    1290

I know it has to do something with the running variable in my for-loop. Any help is appreciated :)

Upvotes: 1

Views: 3531

Answers (2)

Matt W.
Matt W.

Reputation: 3722

I would clean the tweets df to turn things to lowercase, remove stopwords, and punctuation etc. (clean the tweets first otherwise you're going to get "Dog" and "dog" as two different words.

    x <- c("Heute oft gelesen: Hörmann: «Lieber die Fair-Play-Medaille» als Platz eins t.co/w75t1O3zWQ t.co/fQJ2eUbGLf",
"Lokalsport: Wallbaum versteigert Olympia-Kalender t.co/uH5HnJTwUE",
"Die „politischen Spiele“ sind hiermit eröffnet: t.co/EWSnRmNHlw via @de_sputnik")
wordlist <- c("Olympia", "hiermit", "Die")

I would then sapply a tolower version and parse out by spaces. Then I'd flatten it by using unlist so it's a single vector instead of a list, and then unname it so it's a bit easier to read.

wordvec <- unname(unlist(sapply(x, function(z) str_split(tolower(z), " "))))

 [1] "heute"                   "oft"                     "gelesen:"                "hörmann:"                "«lieber"                
 [6] "die"                     "fair-play-medaille»"     "als"                     "platz"                   "eins"                   
[11] "t.co/w75t1o3zwq" "t.co/fqj2eubglf" "lokalsport:"             "wallbaum"                "versteigert"            
[16] "olympia-kalender"        "t.co/uh5hnjtwue" "die"                     "\u0084politischen"       "spiele\u0093"           
[21] "sind"                    "hiermit"                 "eröffnet:"               "t.co/ewsnrmnhlw" "via"                    
[26] "@de_sputnik"  

I think this is still pretty messy. I would look up some text cleaning solutions like removing special characters, or using grepl or something to remove the http stuff.

To filter the list to only contain your words, try:

wordvec[wordvec %in% tolower(wordlist)]
[1] "die"     "die"     "hiermit"

And then you can use table

table(wordvec[wordvec %in% tolower(wordlist)])

die hiermit 
  2       1 

you can do that last part in reverse if you'd like, but I'd focus on cleaning the texts up to remove the special characters and just do some text cleaning.

Upvotes: 1

InfiniteFlash
InfiniteFlash

Reputation: 1058

Here's how I would do it using sapply. The function checks whether data contains 3 consecutive letter combinations and tallies up the count.

library(tidyverse)
library(stringi)

1000 randomly created length 100 letter strings

data <- replicate(100, sample(letters, size = 1000, replace = TRUE))%>%
        data.frame()%>%
        unite("string" , colnames(.) , sep = "", remove = TRUE)%>%
        .$string

head(data)
[1] "uggwaevptdbhhnmvunkcgdssgmulvyxxhnavbxxotwvkvvlycjeftmjufymwzofrhepraqfjlfslynkvbyommaawrvaoscuytfws"
[2] "vftwusfmkzwougqqupeeelcyaefkcxmrqphajcnerfiitttizmpjucohkvsawwiqolkvuofnuarmkriojlnnuvkcreekirfdpsil"
[3] "kbtkrlogalroiulppghcosrpqnryldiuigtsfopsldmcrmnwcwxlhtukvzsujkhqnillzmgwytpneigogvnsxtjgzuuhbjpdvtab"
[4] "cuzqynmbidfwubiuozuhudfqpynnfmkplnyetxxfzvobafmkggiqncykhvmmdrexvxdvtkljppudeiykxsapvpxbbheikydcagey"
[5] "qktsojaevqdegrqunbganigcnvkuxbydepgevcwqqkyekezjddbzqvepodyugwloauxygzgxnwlrjzkyvuihqdfxptwrpsvsdpzf"
[6] "ssfsgxhkankqbrzborfnnvcvqjaykicocxuydzelnuyfljjrhytzgndrktzfglhsuimwjqvvvtvqjsdlnwcbhfdfbsbgdmvfyjef"  

Reference to check data on

three_consec_letters = expand.grid(letters, letters, letters)%>%
                       unite("consec", colnames(.), sep = "", remove = TRUE)%>%
                       .$consec

head(three_consec_letters)
[1] "aaa" "baa" "caa" "daa" "eaa" "faa"

Check and sum if three_consec_letters is in lengthy strings

counts = sapply(three_consec_letters, function(x) stri_detect_fixed(data, x)%>%sum())

Results

head(counts)
aaa baa caa daa eaa faa 
  5   6   6   4   0   3 

Hope this helps.

Upvotes: 0

Related Questions