Ja123
Ja123

Reputation: 63

R: find words from tweets in Lexicon, count them and save number in dataframe with tweets

I have a data set of 50,176 tweets (tweets_data: 50176 obs. of 1 variable). Now, I have created a self-made lexicon (formal_lexicon), which consists of around 1 million words, which are all formal language style. Now, I want to create a small code which per tweet counts how many (if there are any) words are also in that lexicon.

tweets_data:

                   Content            
1                 "Blablabla"               
2                 "Hi my name is"               
3                 "Yes I need"                 
.  
.
. 
50176            "TEXT50176" 

formal_lexicon:

                       X            
1                 "admittedly"               
2                 "Consequently"               
3                 "Furthermore"                 
.  
.
. 
1000000            "meanwhile"   

The output should thus look like:

                  Content             Lexicon
1                 "TEXT1"                1
2                 "TEXT2"                3
3                 "TEXT3"                0 
.  
.
. 
50176            "TEXT50176"             2

Should be a simple for loop like:

for(sentence in tweets_data$Content){ 
  for(word in sentence){
    if(word %in% formal_lexicon){
         ...
}
}
}

I don't think "word" works and I'm not sure how to count in the specific column if a word is in the lexicon. Can anyone help?

structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")

c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )

Upvotes: 0

Views: 210

Answers (2)

s__
s__

Reputation: 9495

You can try something like this:

library(tidytext)
library(dplyr)

# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")
tweets_data <- c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )

# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))


tweets_data_df  %>% 
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)

With result:

Joining, by = "id"
# A tibble: 6 x 3
     id Content                                                              cnt
  <int> <chr>                                                              <dbl>
1     1 "@barackobama Thank you for your incredible grace in leadership a~     0
2     2 "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles Co~     0
3     3 "2017 resolution: to embody authenticity!"                             0
4     4 "Happy Holidays! Sending love and light to every corner of the ea~     0
5     5 "Damn, it's hard to wrap presents when you're drunk. cc @santa"        0
6     6 "When my whole fam tryna have a peaceful holiday "                     0

Upvotes: 1

Johan Rosa
Johan Rosa

Reputation: 3152

Hope this is useful for you:

library(magrittr)
library(dplyr)
library(tidytext)

# Data frame with tweets, including an ID
tweets <- data.frame(
  id = 1:3,
  text = c(
    'Hello, this is the first tweet example to your answer',
    'I hope that my response help you to do your task',
    'If it is tha case, please upvote and mark as the correct answer'
  )
)

lexicon <- data.frame(
  word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)


# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
  tidytext::unnest_tokens(output = 'words', input = text) %>% 
# Determining if a word is in your lexicon
  dplyr::mutate(
    in_lexicon = words %in% lexicon$word
  ) %>% 
  dplyr::group_by(id) %>%
  dplyr::summarise(words_in_lexicon = sum(in_lexicon))

# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)

Upvotes: 0

Related Questions