user72716
user72716

Reputation: 273

Count keywords and word stems in tweets

I have a large dataframe consisting of tweets, and keyword dictionaries loaded as values that have words associated with morality (kw_Moral) and emotion (kw_Emo). In the past I have used the keyword dictionaries to subset a dataframe to get only the tweets that have one or more of the keywords present.

For example, to create a subset with only those tweets that have emotional keywords, I loaded in my keyword dictionary...

kw_Emo <- c("abusi*", "accept", "accepta*", "accepted", 
        "accepting", "accepts", "ache*", "aching", "active*", "admir*", 
        "ador*", "advantag*", "adventur*", "advers*", "affection*", "afraid", 
        "aggravat*", "aggress*", "agoniz*", "agony", "agree", "agreeab*", 
        "agreed", "agreeing", "agreement*", "agrees", "alarm*", "alone", 
        "alright*", "amaz*", "amor*", "amus*", "anger*", "angr*", "anguish*", 
        "annoy*", "antagoni*", "anxi*", "aok", "apath*", "appall*", "appreciat*", 
        "apprehens*", "argh*", "argu*", "arrogan*", "asham*", "assault*", 
        "asshole*", "assur*", "attachment*", "attract*", "aversi*", "avoid*", 
        "award*", "awesome", "awful", "awkward*", "bashful*", "bastard*", 
        "battl*", "beaten", "beaut*", "beloved", "benefic*", "benevolen*", 
        "benign*", "best", "better", "bitch*", "bitter*", "blam*", "bless*", 
        "bold*", "bonus*", "bore*", "boring", "bother*", "brave*", "bright*", 
        "brillian*", "broke", "burden*", "calm*", "cared", "carefree", 
        "careful*", "careless*", "cares", "casual", "casually", "certain*", 
        "challeng*", "champ*", "charit*", "charm*", "cheer*", "cherish*", 
        "chuckl*", "clever*", "comed*", "comfort*", "commitment*", "complain*", 
        "compliment*", "concerned", "confidence", "confident", "confidently", 
        "confront*", "confus*", "considerate", "contempt*", "contented*", 
        "contentment", "contradic*", "convinc*", "cool", "courag*", "crap", 
        "crappy", "craz*", "create*", "creati*", "credit*", "cried", 
        "cries", "critical", "critici*", "crude*", "cry", "crying", "cunt*", 
        "cut", "cute*", "cutie*", "cynic", "danger*", "daring", "darlin*", 
        "daze*", "dear*", "decay*", "defeat*", "defect*", "definite", 
        "definitely", "degrad*", "delectabl*", "delicate*", "delicious*", 
        "deligh*", "depress*", "depriv*", "despair*", "desperat*", "despis*", 
        "destruct*", "determina*", "determined", "devastat*", "difficult*", 
        "digni*", "disadvantage*", "disagree*", "disappoint*", "disaster*", 
        "discomfort*", "discourag*", "dishearten*", "disillusion*", "dislike", 
        "disliked", "dislikes", "disliking", "dismay*", "dissatisf*", 
        "distract*", "distraught", "distress*", "distrust*", "disturb*", 
        "divin*", "domina*", "doom*", "dork*", "doubt*", "dread*", "dull*", 
        "dumb*", "dump*", "dwell*", "dynam*", "eager*", "ease*", "easie*", 
        "easily", "easiness", "easing", "easy*", "ecsta*", "efficien*", 
        "egotis*", "elegan*", "embarrass*", "emotion", "emotional", "empt*", 
        "encourag*", "energ*", "engag*", "enjoy*", "enrag*", "entertain*", 
        "enthus*", "envie*", "envious", "excel*", "excit*", "excruciat*", 
        "exhaust*", "fab", "fabulous*", "fail*", "fake", "fantastic*", 
        "fatal*", "fatigu*", "favor*", "favour*", "fear", "feared", "fearful*", 
        "fearing", "fearless*", "fears", "feroc*", "festiv*", "feud*", 
        "fiery", "fiesta*", "fine", "fired", "flatter*", "flawless*", 
        "flexib*", "flirt*", "flunk*", "foe*", "fond", "fondly", "fondness", 
        "fool*", "forgave", "forgiv*", "fought", "frantic*", "freak*", 
        "free", "freeb*", "freed*", "freeing", "freely", "freeness", 
        "freer", "frees*", "friend*", "fright*", "frustrat*", "fuck", 
        "fucked*", "fucker*", "fuckin*", "fucks", "fume*", "fuming", 
        "fun", "funn*", "furious*", "fury", "geek*", "genero*", "gentle", 
        "gentler", "gentlest", "gently", "giggl*", "giver*", "giving", 
        "glad", "gladly", "glamor*", "glamour*", "gloom*", "glori*", 
        "glory", "goddam*", "gorgeous*", "gossip*", "grace", "graced", 
        "graceful*", "graces", "graci*", "grand", "grande*", "gratef*", 
        "grati*", "grave*", "great", "grief", "griev*", "grim*", "grin", 
        "grinn*", "grins", "grouch*", "grr*", "guilt*", "ha", "haha*", 
        "handsom*", "happi*", "happy", "harass*", "hated", "hateful*", 
        "hater*", "hates", "hating", "hatred", "hazy", "heartbreak*", 
        "heartbroke*", "heartfelt", "heartless*", "heartwarm*", "heh*", 
        "hellish", "helper*", "helpful*", "helping", "helpless*", "helps", 
        "hesita*", "hilarious", "hoho*", "homesick*", "honour*", "hope", 
        "hoped", "hopeful", "hopefully", "hopefulness", "hopeless*", 
        "hopes", "hoping", "horr*", "hostil*", "hug", "hugg*", "hugs", 
        "humiliat*", "humor*", "humour*", "hurra*", "idiot", "ignor*", 
        "impatien*", "impersonal", "impolite*", "importan*", "impress*", 
        "improve*", "improving", "inadequa*", "incentive*", "indecis*", 
        "ineffect*", "inferior*", "inhib*", "innocen*", "insecur*", "insincer*", 
        "inspir*", "insult*", "intell*", "interest*", "interrup*", "intimidat*", 
        "invigor*", "irrational*", "irrita*", "isolat*", "jaded", "jealous*", 
        "jerk", "jerked", "jerks", "joke*", "joking", "joll*", "joy*", 
        "keen*", "kidding", "kind", "kindly", "kindn*", "kiss*", "laidback", 
        "lame*", "laugh*", "lazie*", "lazy", "liabilit*", "libert*", 
        "lied", "lies", "like", "likeab*", "liked", "likes", "liking", 
        "livel*", "LMAO", "LOL", "lone*", "longing*", "lose", "loser*", 
        "loses", "losing", "loss*", "lost", "lous*", "love", "loved", 
        "lovely", "lover*", "loves", "loving*", "low*", "luck", "lucked", 
        "lucki*", "luckless*", "lucks", "lucky", "ludicrous*", "lying", 
        "mad", "maddening", "madder", "maddest", "madly", "magnific*", 
        "maniac*", "masochis*", "melanchol*", "merit*", "merr*", "mess", 
        "messy", "miser*", "miss", "missed", "misses", "missing", "mistak*", 
        "mock", "mocked", "mocker*", "mocking", "mocks", "molest*", "mooch*", 
        "mood", "moodi*", "moods", "moody", "moron*", "mourn*", "nag*", 
        "nast*", "neat*", "needy", "neglect*", "nerd*", "nervous*", "neurotic*", 
        "nice*", "numb*", "nurtur*", "obnoxious*", "obsess*", "offence*", 
        "offens*", "ok", "okay", "okays", "oks", "openminded*", "openness", 
        "opportun*", "optimal*", "optimi*", "original", "outgoing", "outrag*", 
        "overwhelm*", "pained", "painf*", "paining", "painl*", "pains", 
        "palatabl*", "panic*", "paradise", "paranoi*", "partie*", "party*", 
        "passion*", "pathetic*", "peculiar*", "perfect*", "personal", 
        "perver*", "pessimis*", "petrif*", "pettie*", "petty*", "phobi*", 
        "piss*", "piti*", "pity*", "play", "played", "playful*", "playing", 
        "plays", "pleasant*", "please*", "pleasing", "pleasur*", "poison*", 
        "popular*", "positiv*", "prais*", "precious*", "pressur*", "prettie*", 
        "pretty", "prick*", "pride", "privileg*", "prize*", "problem*", 
        "profit*", "promis*", "protested", "protesting", "proud*", "puk*", 
        "radian*", "rage*", "raging", "rancid*", "rape*", "raping", "rapist*", 
        "readiness", "ready", "reassur*", "reek*", "regret*", "reject*", 
        "relax*", "relief", "reliev*", "reluctan*", "remorse*", "repress*", 
        "resent*", "resign*", "resolv*", "restless*", "revigor*", "reward*", 
        "rich*", "ridicul*", "rigid*", "risk*", "ROFL", "romanc*", "romantic*", 
        "rotten", "rude*", "sad", "sadde*", "sadly", "sadness", "sarcas*", 
        "satisf*", "savage*", "scare*", "scaring", "scary", "sceptic*", 
        "scream*", "screw*", "selfish*", "sentimental*", "serious", "seriously", 
        "seriousness", "severe*", "shake*", "shaki*", "shaky", "share", 
        "shared", "shares", "sharing", "shit*", "shock*", "shook", "shy*", 
        "sigh", "sighed", "sighing", "sighs", "silli*", "silly", "sincer*", 
        "skeptic*", "smart*", "smil*", "smother*", "smug*", "snob*", 
        "sob", "sobbed", "sobbing", "sobs", "sociab*", "solemn*", "sorrow*", 
        "sorry", "soulmate*", "special", "splend*", "stammer*", "stank", 
        "startl*", "stink*", "strain*", "strange", "strength*", "stress*", 
        "strong*", "struggl*", "stubborn*", "stunk", "stunned", "stuns", 
        "stupid*", "stutter*", "succeed*", "success*", "suck", "sucked", 
        "sucker*", "sucks", "sucky", "sunnier", "sunniest", "sunny", 
        "sunshin*", "super", "superior*", "support", "supported", "supporter*", 
        "supporting", "supportive*", "supports", "suprem*", "sure*", 
        "surpris*", "suspicio*", "sweet", "sweetheart*", "sweetie*", 
        "sweetly", "sweetness*", "sweets", "talent*", "tantrum*", "tears", 
        "teas*", "tehe", "temper", "tempers", "tender*", "tense*", "tensing", 
        "tension*", "terribl*", "terrific*", "terrified", "terrifies", 
        "terrify", "terrifying", "terror*", "thank", "thanked", "thankf*", 
        "thanks", "thief", "thieve*", "thoughtful*", "threat*", "thrill*", 
        "ticked", "timid*", "toleran*", "tortur*", "tough*", "traged*", 
        "tragic*", "tranquil*", "trauma*", "treasur*", "treat", "trembl*", 
        "trick*", "trite", "triumph*", "trivi*", "troubl*", "TRUE", "trueness", 
        "truer", "truest", "truly", "trust*", "truth*", "turmoil", "ugh", 
        "ugl*", "unattractive", "uncertain*", "uncomfortabl*", "uncontrol*", 
        "uneas*", "unfortunate*", "unfriendly", "ungrateful*", "unhapp*", 
        "unimportant", "unimpress*", "unkind", "unlov*", "unpleasant", 
        "unprotected", "unsavo*", "unsuccessful*", "unsure*", "unwelcom*", 
        "upset*", "uptight*", "useful*", "useless*", "vain", "valuabl*", 
        "valuing", "vanity", "vicious*", "vigor*", "vigour*", "villain*", 
        "violat*", "virtuo*", "vital*", "vulnerab*", "vulture*", "warfare*", 
        "warm*", "warred", "weak*", "wealth*", "weapon*", "weep*", "weird*", 
        "welcom*", "well*", "wept", "whine*", "whining", "willing", "wimp*", 
        "win", "winn*", "wins", "wisdom", "wise*", "witch", "woe*", "won", 
        "wonderf*", "worr*", "worse*", "worship*", "worst", "wow*", "yay", 
        "yays","yearn*","stench*") %>% paste0(collapse="|")and then filtered my dataframe with the keywords...

tweets_E <- tweets[with(tweets, grepl(paste0("\\b(?:",paste(kw_Emo, collapse="|"),")\\b"), text)),]

How do I expand on this process to count exactly how many of the dictionary words appear in each tweet? In other words, I want to add a vector to the dataframe, say EmoWordCount, that will show the number of emotional words appear in each tweet.


Here's a reproducible sample of my data:

dput(droplevels(head(TestTweets, 20)))

structure(list(Time = c("24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:03", "24/06/2016 10:55:03"
), clean_text = c("mayagoodfellow as always making sense of it all for us ive never felt less welcome in this country brexit  httpstcoiai5xa9ywv", 
"never underestimate power of stupid people in a democracy brexit", 
"a quick guide to brexit and beyond after britain votes to quit eu httpstcos1xkzrumvg httpstcocniutojkt0", 
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref httpstcoegph3qonbj httpstcohbyhxodeda", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit httpstcolobtyo48ng", 
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal httpstco", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"absolutely brilliant poll on brexit by yougov httpstcoepevg1moaw", 
"retweeted mikhail golub golub\r\n\r\nbrexit to be followed by grexit departugal italeave fruckoff czechout httpstcoavkpfesddz", 
"think the brexit campaign relies on the same sort of logic that drpepper does whats the worst that can happen thingsthatarewellbrexit", 
"am baffled by nigel farages claim that brexit is a victory for real people as if the 47 voting remain are fucking smu", 
"not one of the uks problems has been solved by brexit vote migration inequality the uks centurylong decline as", 
"scotland should never leave eu  calls for new independence vote grow httpstcorudiyvthia brexit", 
"the most articulate take on brexit is actually this ft reader comment today httpstco98b4dwsrtv", 
"65 million refugees half of them are children  maybe instead of fighting each other we should be working hand in hand ", 
"im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain", 
"life is too short to wear boring shoes  brexit")), .Names = c("Time", 
"clean_text"), row.names = c(NA, 20L), class = c("tbl_df", "tbl", 
"data.frame"))

Here is the code I used from Francisco:

library(stringr)

 for (x in 1:length(kw_Emo)) {
   if (grepl("[*]", kw_Emo[x]) == TRUE) {
     kw_Emo[x] <- substr(kw_Emo[x],1,nchar(kw_Emo[x])-1)
   }
 }

 for (x in 1:length(kw_Emo)) {
   TestTweets[, kw_Emo[x]] <- 0
 }

 for (x in 1:nrow(TestTweets)) {
   partials <- data.frame(str_split(TestTweets[x,2], " "), stringsAsFactors=FALSE)
   partials <- partials[partials[] != ""]
   for(y in 1:length(partials)) {
     for (z in 1:length(kw_Emo)) {
       if (kw_Emo[z] == partials[y]) {
         TestTweets[x, kw_Emo[z]] <- TestTweets[x, kw_Emo[z]] + 1
       }
     }
   }
 }

Here is the output from Francisco's solution below (I renamed the new column EmoWordCount):

structure(list(Time = c("24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:03", "24/06/2016 10:55:03"
), clean_text = c("mayagoodfellow as always making sense of it all for us ive never felt less welcome in this country brexit  httpstcoiai5xa9ywv", 
"never underestimate power of stupid people in a democracy brexit", 
"a quick guide to brexit and beyond after britain votes to quit eu httpstcos1xkzrumvg httpstcocniutojkt0", 
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref httpstcoegph3qonbj httpstcohbyhxodeda", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit httpstcolobtyo48ng", 
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal httpstco", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"absolutely brilliant poll on brexit by yougov httpstcoepevg1moaw", 
"retweeted mikhail golub golub\r\n\r\n\r\n\r\nbrexit to be followed by grexit departugal italeave fruckoff czechout httpstcoavkpfesddz", 
"think the brexit campaign relies on the same sort of logic that drpepper does whats the worst that can happen thingsthatarewellbrexit", 
"am baffled by nigel farages claim that brexit is a victory for real people as if the 47 voting remain are fucking smu", 
"not one of the uks problems has been solved by brexit vote migration inequality the uks centurylong decline as", 
"scotland should never leave eu  calls for new independence vote grow httpstcorudiyvthia brexit", 
"the most articulate take on brexit is actually this ft reader comment today httpstco98b4dwsrtv", 
"65 million refugees half of them are children  maybe instead of fighting each other we should be working hand in hand ", 
"im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\n\r\nremain", 
"life is too short to wear boring shoes  brexit"), EmoWordCount = c(3, 
2, 0, 3, 5, 4, 3, 5, 7, 5, 2, 5, 11, 6, 6, 5, 1, 7, 6, 4)), .Names = c("Time", 
"clean_text", "EmoWordCount"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 2

Views: 176

Answers (2)

Francisco Ghelfi
Francisco Ghelfi

Reputation: 962

I don´t know if this is the optimal solution, but it works just fine. You should use the "stringr" package.

library(stringr)

 for (x in 1:length(keywords)) {
  if (grepl("[*]", keywords[x]) == TRUE) {
    keywords[x] <- substr(keywords[x],1,nchar(keywords[x])-1)
      }
    }

Here I remove the "*" symbol from some of your keywords (which I understand you want to analyze their partial inclusion in a string.

IMPORTANT:

should use the regex expression [*] to catch the * symbol.

for (x in 1:length(keywords)) {
  dataframe[, keywords[x]] <- 0
    }

Just creating the new columns with default values in 0.

for (x in 1:nrow(dataframe)) {
  partials <- data.frame(str_split(dataframe[x,2], " "), stringsAsFactors=FALSE)
  partials <- partials[partials[] != ""]
  for(y in 1:length(partials)) {
    for (z in 1:length(keywords)) {
      if (keywords[z] == partials[y]) {
        dataframe[x, keywords[z]] <- dataframe[x, keywords[z]] + 1
      }
    }
  }
}

You split each Tweet to a vector of words, see if the keyword is equal to any, add +1 if they are present and end up with the same dataframe but with new columns for each keyword.

I tested it with your keywords, and gives the correct answer.

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521914

Your requirement would seem to lend itself to a matrix type output, where, for example, the tweets are rows, and each term is a column, with the cell value being the number of occurrences. Here is a base R solution using gsub:

terms <- c("cat", "hat", "bat")
tweets <- c("The cat in a hat met the man with the hat and a bat",
            "That cat was a fast cat!",
            "I bought a baseball bat while wearing a hat")

output <- sapply(terms, function(x) {
    sapply(tweets, function(y) {
        (nchar(y) - nchar(gsub(paste0("\\b", x, "\\b"), "", y))) / nchar(x)
    })
})

                                                    cat hat bat
The cat in a hat met the man with the hat and a bat   1   2   1
That cat was a fast cat!                              2   0   0
I bought a baseball bat while wearing a hat           0   1   1

This approaches first iterates each keyword in terms using sapply, and then iterates each tweet. For each keyword/tweet combination, it computes the number of occurrences. The trick I used was to compare the length of the original tweet against the length of the same tweet with all occurrences of a keyword removed, that difference then normalized by the length of the particular keyword.

Edit:

If you instead want a total sum of keyword occurrences for each tweet, then we can just call rowSums on the above matrix:

rowSums(output)

The cat in a hat met the man with the hat and a bat
                                                  4
                           That cat was a fast cat!
                                                  2
        I bought a baseball bat while wearing a hat
                                                  2

Upvotes: 0

Related Questions