user72716
user72716

Reputation: 273

Counting words and word stems in a large dataframe (RStudio)

I have a large dataframe consisting of tweets, and a keyword dictionary loaded as a list that has words and word stems associated with emotion (kw_Emo). I need to find a way to count how many times any given word/word stem from kw_Emo is present each tweet. In kw_Emo, word stems are marked with an asterisk ( * ). For example, one word stem is ador*, meaning that I need to account for the presence of adorable, adore, adoring, or any pattern of letters that starts with ador….


From a previous Stack Overflow discussion (see previous question on my profile), I was greatly helped with the following solution, but it only counts exact character matches (Ex. only ador, not adorable):

  1. Load relevant package.

    library(stringr)

  2. Identify and remove the * from word stems in kw_Emo.

    for (x in 1:length(kw_Emo)) { if (grepl("[*]", kw_Emo[x]) == TRUE) { kw_Emo[x] <- substr(kw_Emo[x],1,nchar(kw_Emo[x])-1) } }

  3. Create new columns, one for each word/word stem from kw_Emo, with default value 0.

    for (x in 1:length(keywords)) { dataframe[, keywords[x]] <- 0}

  4. Split each Tweet to a vector of words, see if the keyword is equal to any, add +1 to the appropriate word/word stems' column.

    for (x in 1:nrow(dataframe)) { partials <- data.frame(str_split(dataframe[x,2], " "), stringsAsFactors=FALSE) partials <- partials[partials[] != ""] for(y in 1:length(partials)) { for (z in 1:length(keywords)) { if (keywords[z] == partials[y]) { dataframe[x, keywords[z]] <- dataframe[x, keywords[z]] + 1 } } } }

Is there a way to alter this solution to account for word stems? I'm wondering if it's possible to first use a stringr pattern to replace occurrences of a word stem with the exact characters, and then use this exact match solution. For instance, something like stringr::str_replace_all(x, "ador[a-z]+", "ador"). But I'm unsure how to do this with my large dictionary and numerous word stems. Maybe the loop removing [*], which essentially identifies all word stems, can be adapted somehow?


Here is a reproducible sample of my dataframe, called TestTweets with the text to be analysed in a column called clean_text:

dput(droplevels(head(TestTweets, 20)))

structure(list(Time = c("24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:03", "24/06/2016 10:55:03"
), clean_text = c("mayagoodfellow as always making sense of it all for us ive never felt less welcome in this country brexit  httpstcoiai5xa9ywv", 
"never underestimate power of stupid people in a democracy brexit", 
"a quick guide to brexit and beyond after britain votes to quit eu httpstcos1xkzrumvg httpstcocniutojkt0", 
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref httpstcoegph3qonbj httpstcohbyhxodeda", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit httpstcolobtyo48ng", 
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal httpstco", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"absolutely brilliant poll on brexit by yougov httpstcoepevg1moaw", 
"retweeted mikhail golub golub\r\n\r\nbrexit to be followed by grexit departugal italeave fruckoff czechout httpstcoavkpfesddz", 
"think the brexit campaign relies on the same sort of logic that drpepper does whats the worst that can happen thingsthatarewellbrexit", 
"am baffled by nigel farages claim that brexit is a victory for real people as if the 47 voting remain are fucking smu", 
"not one of the uks problems has been solved by brexit vote migration inequality the uks centurylong decline as", 
"scotland should never leave eu  calls for new independence vote grow httpstcorudiyvthia brexit", 
"the most articulate take on brexit is actually this ft reader comment today httpstco98b4dwsrtv", 
"65 million refugees half of them are children  maybe instead of fighting each other we should be working hand in hand ", 
"im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain", 
"life is too short to wear boring shoes  brexit")), .Names = c("Time", 
"clean_text"), row.names = c(NA, 20L), class = c("tbl_df", "tbl", 
"data.frame"))

Here is kw_Emo:

kw_Emo <- c("abusi*", "accept", "accepta*", "accepted", "accepting", "accepts", "ache*", "aching", "active*", "admir*", "ador*", "advantag*", "adventur*", "advers*", "affection*", "afraid", "aggravat*", "aggress*", "agoniz*", "agony", "agree", "agreeab*", "agreed", "agreeing", "agreement*", "agrees", "alarm*", "alone", "alright*", "amaz*", "amor*", "amus*", "anger*", "angr*", "anguish*", "annoy*", "antagoni*", "anxi*", "aok", "apath*", "appall*", "appreciat*", "apprehens*", "argh*", "argu*", "arrogan*", "asham*", "assault*", "asshole*", "assur*", "attachment*", "attract*", "aversi*", "avoid*", "award*", "awesome", "awful", "awkward*", "bashful*", "bastard*", "battl*", "beaten", "beaut*", "beloved", "benefic*", "benevolen*", "benign*", "best", "better", "bitch*", "bitter*", "blam*", "bless*", "bold*", "bonus*", "bore*", "boring", "bother*", "brave*", "bright*", "brillian*", "broke", "burden*", "calm*", "cared", "carefree", "careful*", "careless*", "cares", "casual", "casually", "certain*", "challeng*", "champ*", "charit*", "charm*", "cheer*", "cherish*", "chuckl*", "clever*", "comed*", "comfort*", "commitment*", "complain*", "compliment*", "concerned", "confidence", "confident", "confidently", "confront*", "confus*", "considerate", "contempt*", "contented*", "contentment", "contradic*", "convinc*", "cool", "courag*", "crap", "crappy", "craz*", "create*", "creati*", "credit*", "cried", "cries", "critical", "critici*", "crude*", "cry", "crying", "cunt*", "cut", "cute*", "cutie*", "cynic", "danger*", "daring", "darlin*", "daze*", "dear*", "decay*", "defeat*", "defect*", "definite", "definitely", "degrad*", "delectabl*", "delicate*", "delicious*", "deligh*", "depress*", "depriv*", "despair*", "desperat*", "despis*", "destruct*", "determina*", "determined", "devastat*", "difficult*", "digni*", "disadvantage*", "disagree*", "disappoint*", "disaster*", "discomfort*", "discourag*", "dishearten*", "disillusion*", "dislike", "disliked", "dislikes", "disliking", "dismay*", "dissatisf*", "distract*", "distraught", "distress*", "distrust*", "disturb*", "divin*", "domina*", "doom*", "dork*", "doubt*", "dread*", "dull*", "dumb*", "dump*", "dwell*", "dynam*", "eager*", "ease*", "easie*", "easily", "easiness", "easing", "easy*", "ecsta*", "efficien*", "egotis*", "elegan*", "embarrass*", "emotion", "emotional", "empt*", "encourag*", "energ*", "engag*", "enjoy*", "enrag*", "entertain*", "enthus*", "envie*", "envious", "excel*", "excit*", "excruciat*", "exhaust*", "fab", "fabulous*", "fail*", "fake", "fantastic*", "fatal*", "fatigu*", "favor*", "favour*", "fear", "feared", "fearful*", "fearing", "fearless*", "fears", "feroc*", "festiv*", "feud*", "fiery", "fiesta*", "fine", "fired", "flatter*", "flawless*", "flexib*", "flirt*", "flunk*", "foe*", "fond", "fondly", "fondness", "fool*", "forgave", "forgiv*", "fought", "frantic*", "freak*", "free", "freeb*", "freed*", "freeing", "freely", "freeness", "freer", "frees*", "friend*", "fright*", "frustrat*", "fuck", "fucked*", "fucker*", "fuckin*", "fucks", "fume*", "fuming", "fun", "funn*", "furious*", "fury", "geek*", "genero*", "gentle", "gentler", "gentlest", "gently", "giggl*", "giver*", "giving", "glad", "gladly", "glamor*", "glamour*", "gloom*", "glori*", "glory", "goddam*", "gorgeous*", "gossip*", "grace", "graced", "graceful*", "graces", "graci*", "grand", "grande*", "gratef*", "grati*", "grave*", "great", "grief", "griev*", "grim*", "grin", "grinn*", "grins", "grouch*", "grr*", "guilt*", "ha", "haha*", "handsom*", "happi*", "happy", "harass*", "hated", "hateful*", "hater*", "hates", "hating", "hatred", "hazy", "heartbreak*", "heartbroke*", "heartfelt", "heartless*", "heartwarm*", "heh*", "hellish", "helper*", "helpful*", "helping", "helpless*", "helps", "hesita*", "hilarious", "hoho*", "homesick*", "honour*", "hope", "hoped", "hopeful", "hopefully", "hopefulness", "hopeless*", "hopes", "hoping", "horr*", "hostil*", "hug", "hugg*", "hugs", "humiliat*", "humor*", "humour*", "hurra*", "idiot", "ignor*", "impatien*", "impersonal", "impolite*", "importan*", "impress*", "improve*", "improving", "inadequa*", "incentive*", "indecis*", "ineffect*", "inferior*", "inhib*", "innocen*", "insecur*", "insincer*", "inspir*", "insult*", "intell*", "interest*", "interrup*", "intimidat*", "invigor*", "irrational*", "irrita*", "isolat*", "jaded", "jealous*", "jerk", "jerked", "jerks", "joke*", "joking", "joll*", "joy*", "keen*", "kidding", "kind", "kindly", "kindn*", "kiss*", "laidback", "lame*", "laugh*", "lazie*", "lazy", "liabilit*", "libert*", "lied", "lies", "like", "likeab*", "liked", "likes", "liking", "livel*", "LMAO", "LOL", "lone*", "longing*", "lose", "loser*", "loses", "losing", "loss*", "lost", "lous*", "love", "loved", "lovely", "lover*", "loves", "loving*", "low*", "luck", "lucked", "lucki*", "luckless*", "lucks", "lucky", "ludicrous*", "lying", "mad", "maddening", "madder", "maddest", "madly", "magnific*", "maniac*", "masochis*", "melanchol*", "merit*", "merr*", "mess", "messy", "miser*", "miss", "missed", "misses", "missing", "mistak*", "mock", "mocked", "mocker*", "mocking", "mocks", "molest*", "mooch*", "mood", "moodi*", "moods", "moody", "moron*", "mourn*", "nag*", "nast*", "neat*", "needy", "neglect*", "nerd*", "nervous*", "neurotic*", "nice*", "numb*", "nurtur*", "obnoxious*", "obsess*", "offence*", "offens*", "ok", "okay", "okays", "oks", "openminded*", "openness", "opportun*", "optimal*", "optimi*", "original", "outgoing", "outrag*", "overwhelm*", "pained", "painf*", "paining", "painl*", "pains", "palatabl*", "panic*", "paradise", "paranoi*", "partie*", "party*", "passion*", "pathetic*", "peculiar*", "perfect*", "personal", "perver*", "pessimis*", "petrif*", "pettie*", "petty*", "phobi*", "piss*", "piti*", "pity*", "play", "played", "playful*", "playing", "plays", "pleasant*", "please*", "pleasing", "pleasur*", "poison*", "popular*", "positiv*", "prais*", "precious*", "pressur*", "prettie*", "pretty", "prick*", "pride", "privileg*", "prize*", "problem*", "profit*", "promis*", "protested", "protesting", "proud*", "puk*", "radian*", "rage*", "raging", "rancid*", "rape*", "raping", "rapist*", "readiness", "ready", "reassur*", "reek*", "regret*", "reject*", "relax*", "relief", "reliev*", "reluctan*", "remorse*", "repress*", "resent*", "resign*", "resolv*", "restless*", "revigor*", "reward*", "rich*", "ridicul*", "rigid*", "risk*", "ROFL", "romanc*", "romantic*", "rotten", "rude*", "sad", "sadde*", "sadly", "sadness", "sarcas*", "satisf*", "savage*", "scare*", "scaring", "scary", "sceptic*", "scream*", "screw*", "selfish*", "sentimental*", "serious", "seriously", "seriousness", "severe*", "shake*", "shaki*", "shaky", "share", "shared", "shares", "sharing", "shit*", "shock*", "shook", "shy*", "sigh", "sighed", "sighing", "sighs", "silli*", "silly", "sincer*", "skeptic*", "smart*", "smil*", "smother*", "smug*", "snob*", "sob", "sobbed", "sobbing", "sobs", "sociab*", "solemn*", "sorrow*", "sorry", "soulmate*", "special", "splend*", "stammer*", "stank", "startl*", "stink*", "strain*", "strange", "strength*", "stress*", "strong*", "struggl*", "stubborn*", "stunk", "stunned", "stuns", "stupid*", "stutter*", "succeed*", "success*", "suck", "sucked", "sucker*", "sucks", "sucky", "sunnier", "sunniest", "sunny", "sunshin*", "super", "superior*", "support", "supported", "supporter*", "supporting", "supportive*", "supports", "suprem*", "sure*", "surpris*", "suspicio*", "sweet", "sweetheart*", "sweetie*", "sweetly", "sweetness*", "sweets", "talent*", "tantrum*", "tears", "teas*", "tehe", "temper", "tempers", "tender*", "tense*", "tensing", "tension*", "terribl*", "terrific*", "terrified", "terrifies", "terrify", "terrifying", "terror*", "thank", "thanked", "thankf*", "thanks", "thief", "thieve*", "thoughtful*", "threat*", "thrill*", "ticked", "timid*", "toleran*", "tortur*", "tough*", "traged*", "tragic*", "tranquil*", "trauma*", "treasur*", "treat", "trembl*", "trick*", "trite", "triumph*", "trivi*", "troubl*", "TRUE", "trueness", "truer", "truest", "truly", "trust*", "truth*", "turmoil", "ugh", "ugl*", "unattractive", "uncertain*", "uncomfortabl*", "uncontrol*", "uneas*", "unfortunate*", "unfriendly", "ungrateful*", "unhapp*", "unimportant", "unimpress*", "unkind", "unlov*", "unpleasant", "unprotected", "unsavo*", "unsuccessful*", "unsure*", "unwelcom*", "upset*", "uptight*", "useful*", "useless*", "vain", "valuabl*", "valuing", "vanity", "vicious*", "vigor*", "vigour*", "villain*", "violat*", "virtuo*", "vital*", "vulnerab*", "vulture*", "warfare*", "warm*", "warred", "weak*", "wealth*", "weapon*", "weep*", "weird*", "welcom*", "well*", "wept", "whine*", "whining", "willing", "wimp*", "win", "winn*", "wins", "wisdom", "wise*", "witch", "woe*", "won", "wonderf*", "worr*", "worse*", "worship*", "worst", "wow*", "yay", "yays","yearn*","stench*")


Code used from MRau's answer that hasn't worked for me:

ind_stem <- grep("[*]", kw_Emo)
kw_stem  <- gsub("[*]", "", kw_Emo[ind_stem])
kw_word  <- kw_Emo[-ind_stem]
tweets <- strsplit(TestTweets[, "clean_text"], "\\s+")

for (kws in kw_stem) {
  count_i <- unlist(lapply(tweets, function(x) length(grep(kws, x))))
  TestTweets <- cbind(TestTweets, count_i)
  colnames(TestTweets)[ncol(TestTweets)] <- paste0(kws, "*")
}
for (kww in kw_word) {
  count_i <- unlist(lapply(tweets, function(x) length(grep(paste0("^", kww, "$"), x))))
  TestTweets <- cbind(TestTweets, count_i)
  colnames(TestTweets)[ncol(TestTweets)] <- kww
}

Upvotes: 0

Views: 390

Answers (1)

MRau
MRau

Reputation: 336

So first of all I would get rid of some of the for loops:

ind_stem <- grep("[*]", kw_Emo)
kw_stem  <- gsub("[*]", "", kw_Emo[ind_stem])
kw_word  <- kw_Emo[-ind_stem]
tweets <- strsplit(TestTweets[, "clean_text"], "\\s+")

I generated a different vector for words and stems. tweets is a list of vector of words - strsplit splits the strings using blank space (\\s+) as separator.

When it comes to the matching of words/stems, you can use grep for both. By default, it will find all the words containing the given pattern:

> grep("Abc", c("Abc", "Abcdef"))
[1] 1 2

But you can get the "exact" match if you use ^ and $:

> grep("^Abc$", c("Abc", "Abcdef"))
[1] 1

In you code, you want to look at the length of the grep output and e.g. append it to your data.frame:

for (kws in kw_stem) {
    count_i <- unlist(lapply(tweets, function(x) length(grep(kws, x))))
    TestTweets <- cbind(TestTweets, count_i)
    colnames(TestTweets)[ncol(TestTweets)] <- paste0(kws, "*")
}
for (kww in kw_word) {
    count_i <- unlist(lapply(tweets, function(x) length(grep(paste0("^", kww, "$"), x))))
    TestTweets <- cbind(TestTweets, count_i)
    colnames(TestTweets)[ncol(TestTweets)] <- kww
}

Fragment of output:

> TestTweets[19:20, c("clean_text", "boring")]
                                                                                                                    clean_text boring
19 im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain      0
20                                                                           life is too short to wear boring shoes  brexit      1

Of course, you can further optimize this code or decide whether or not use grep(paste0("^", kws), x) instead of grep(kws, x) in the first loop depending on your problem etc.

Upvotes: 1

Related Questions