Reputation: 684
In the past, I have received help with building a tf-idf for the one of my document and got an output which I wanted (please see below).
TagSet <- data.frame(emoticon = c("🤔","🍺","💪","🥓","😃"),
stringsAsFactors = FALSE)
TextSet <- data.frame(tweet = c("🤔Sharp, adversarial⚔️~pro choice💪~ban Pit Bulls☠️~BSL🕊️~aberant psychology😈~common sense🤔~the Piper will lead us to reason🎵~sealskin woman🐺",
"Blocked by Owen, Adonis. Abbott & many #FBPE😃 Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement 🇬🇧",
"🇺🇸🇺🇸🇺🇸🇺🇸 #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"💙🖤I #BackTheBlue for my son!🖤💙 Facts Over Feelings. Border Security saves lives! #ThankYouICE",
"🤔🇺🇸🇺🇸 I play Pedal Steel @CooderGraw & #CharlieShafter🇺🇸🇺🇸 #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc 🕵️♂️ 🥓🚁 🍺 🏴🇬🇧🇨🇿",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"🌸🐦❄️Do not dwell in tbaconhe past, do not dream of the future, concentrate the mind on the present moment.🌸🐿️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro 🇸🇪 | 👋🏼Hello intro on the Minds Link |"),
stringsAsFactors = FALSE)
library(dplyr)
library(quanteda)
tweets_dfm <- dfm(TextSet$tweet) # convert to document-feature matrix
tweets_dfm %>%
dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
dfm_tfidf() %>% # weight with tfidf
convert("data.frame") # turn into data.frame to display more easily
# document 🤔 🍺 💪 🥓 😃
# 1 text1 1.39794 1 0 0 0
# 2 text2 0.00000 0 1 0 0
# 3 text3 0.00000 0 0 0 0
# 4 text4 0.00000 0 0 0 0
# 5 text5 0.00000 0 0 0 0
# 6 text6 0.69897 0 0 0 0
# 7 text7 0.00000 0 0 1 1
# 8 text8 0.00000 0 0 0 0
# 9 text9 0.00000 0 0 0 0
# 10 text10 0.00000 0 0 0 0
But I need a little help with calculating tf-idf per singular term. Meaning, how do I accurately get the tf-idf value for each term from the matrix?
# terms tfidf
# 🤔 #its tfidf the correct way
# 🍺 #its tfidf the correct way
# 💪 #its tfidf the correct way
# 🥓 #its tfidf the correct way
# 😃 #its tfidf the correct way
I am sure, it's not like add all of tf-idf for a term from its matrix column and divide by documents where it appeared. And that would be the value for that term.
I have looked at a few sources such as here, https://stats.stackexchange.com/questions/422750/how-to-calculate-tf-idf-for-a-single-term, but the author is asking something else entirely from what I read.
I am currently weak in text-mining/analysis terminology.
Upvotes: 1
Views: 545
Reputation: 14902
In short, you cannot compute a tf-idf value for each feature, isolated from its document context, because each tf-idf value for a feature is specific to a document.
More specifically:
You can see this in your example:
> tweets_dfm %>%
+ dfm_tfidf() %>%
+ dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
+ as.matrix()
features
docs \U0001f914 \U0001f4aa \U0001f603 \U0001f953 \U0001f37a
text1 1.39794 1 0 0 0
text2 0.00000 0 1 0 0
text3 0.00000 0 0 0 0
text4 0.00000 0 0 0 0
text5 0.00000 0 0 0 0
text6 0.69897 0 0 0 0
text7 0.00000 0 0 1 1
text8 0.00000 0 0 0 0
text9 0.00000 0 0 0 0
text10 0.00000 0 0 0 0
Two more things:
Averaging by features is not really something that makes sense given the inverse document frequency's characteristic of already being a type of average, or at least the inverse proportion of documents in which a term occurs. Furthermore, this is usually logged so would require some transformation before you could average it.
Above, I computed the tf-idf before removing the other features, since this will redefine term frequency if you use relative ("normalized") term frequencies. dfm_tfidf()
uses term counts by default, so the results here are unaffected.
Upvotes: 3