CaseebRamos
CaseebRamos

Reputation: 684

How to calculate tf-idf for a single term after getting the tf-idf matrix?

In the past, I have received help with building a tf-idf for the one of my document and got an output which I wanted (please see below).

TagSet <- data.frame(emoticon = c("🤔","🍺","💪","🥓","😃"),
                     stringsAsFactors = FALSE)

TextSet <- data.frame(tweet = c("🤔Sharp, adversarial⚔️~pro choice💪~ban Pit Bulls☠️~BSL🕊️~aberant psychology😈~common sense🤔~the Piper will lead us to reason🎵~sealskin woman🐺",
                                "Blocked by Owen, Adonis. Abbott & many #FBPE😃 Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement 🇬🇧",
                                "🇺🇸🇺🇸🇺🇸🇺🇸 #healthy #vegetarian #beatchronicillness fix infrastructure",
                                "LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
                                "💙🖤I #BackTheBlue for my son!🖤💙 Facts Over Feelings. Border Security saves lives! #ThankYouICE",
                                "🤔🇺🇸🇺🇸 I play Pedal Steel @CooderGraw & #CharlieShafter🇺🇸🇺🇸 #GoStars #LiberalismIsAMentalDisorder",
                                "#Englishman  #Londoner  @Chelseafc  🕵️‍♂️ 🥓🚁 🍺 🏴󠁧󠁢󠁥󠁮󠁧󠁿🇬🇧🇨🇿",
                                "F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
                                "🌸🐦❄️Do not dwell in tbaconhe past, do not dream of the future, concentrate the mind on the present moment.🌸🐿️❄️",
                                "Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro 🇸🇪 | 👋🏼Hello intro on the Minds Link |"),
                      stringsAsFactors = FALSE)


library(dplyr)
library(quanteda)

tweets_dfm <- dfm(TextSet$tweet)  # convert to document-feature matrix

tweets_dfm %>% 
  dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
  dfm_tfidf() %>%                 # weight with tfidf
  convert("data.frame")           # turn into data.frame to display more easily

#     document       🤔             🍺           💪          🥓           😃
# 1     text1      1.39794            1            0            0            0
# 2     text2      0.00000            0            1            0            0
# 3     text3      0.00000            0            0            0            0
# 4     text4      0.00000            0            0            0            0
# 5     text5      0.00000            0            0            0            0
# 6     text6      0.69897            0            0            0            0
# 7     text7      0.00000            0            0            1            1
# 8     text8      0.00000            0            0            0            0
# 9     text9      0.00000            0            0            0            0
# 10   text10      0.00000            0            0            0            0

But I need a little help with calculating tf-idf per singular term. Meaning, how do I accurately get the tf-idf value for each term from the matrix?

# terms      tfidf
# 🤔      #its tfidf the correct way   
# 🍺      #its tfidf the correct way 
# 💪      #its tfidf the correct way 
# 🥓      #its tfidf the correct way 
# 😃      #its tfidf the correct way 

I am sure, it's not like add all of tf-idf for a term from its matrix column and divide by documents where it appeared. And that would be the value for that term.

I have looked at a few sources such as here, https://stats.stackexchange.com/questions/422750/how-to-calculate-tf-idf-for-a-single-term, but the author is asking something else entirely from what I read.

I am currently weak in text-mining/analysis terminology.

Upvotes: 1

Views: 545

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

In short, you cannot compute a tf-idf value for each feature, isolated from its document context, because each tf-idf value for a feature is specific to a document.

More specifically:

  • (inverse) document frequency is one value per feature, so indexed by $j$
  • term frequency is one value per term per document, so indexed by $ij$
  • tf-idf is therefore indexed by $i,j$

You can see this in your example:

> tweets_dfm %>% 
+   dfm_tfidf() %>%
+   dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
+   as.matrix()
        features
docs     \U0001f914 \U0001f4aa \U0001f603 \U0001f953 \U0001f37a
  text1     1.39794          1          0          0          0
  text2     0.00000          0          1          0          0
  text3     0.00000          0          0          0          0
  text4     0.00000          0          0          0          0
  text5     0.00000          0          0          0          0
  text6     0.69897          0          0          0          0
  text7     0.00000          0          0          1          1
  text8     0.00000          0          0          0          0
  text9     0.00000          0          0          0          0
  text10    0.00000          0          0          0          0

Two more things:

  1. Averaging by features is not really something that makes sense given the inverse document frequency's characteristic of already being a type of average, or at least the inverse proportion of documents in which a term occurs. Furthermore, this is usually logged so would require some transformation before you could average it.

  2. Above, I computed the tf-idf before removing the other features, since this will redefine term frequency if you use relative ("normalized") term frequencies. dfm_tfidf() uses term counts by default, so the results here are unaffected.

Upvotes: 3

Related Questions