Why tf-idf truncates words?

Question

I have a dataframe x that is:

> str(x)
'data.frame':   117654 obs. of  2 variables:
$ text  : chr  "more about " ...
$ doc_id: chr  "Text 1" "Text 2" "Text 3" "Text 4" ...

I can't report it here, with dput, because it is too large. I'm trying to estimate the TF-IDF and I wrote the code:

library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- x %>%
  mutate(text = as.character(text)) %>% 
  unnest_tokens(output = word, input = text) %>%
  count(doc_id, word, sort = TRUE)

book_words <- book_words %>%
  bind_tf_idf(term = word, document = doc_id, n)

book_words<-book_words[order(book_words$tf_idf,decreasing=FALSE),]
book_words = book_words[!duplicated(book_words$word),]

Anyway, I noticed that some words appear to be truncated in book_words. For example:

             doc_id          word n  tf      idf    tf_idf
 792727  Text 33268     disposabl 1 1.0 11.67321 11.673214

I'm sure this is a truncated term because, if I run:

x[grepl("^disposabl$",x$text),]

I obtain no rows.

This ever happen to you?

r2evans · Accepted Answer

From your output, it appears that there is leading blank-space in the name. If it were just "dispoabl" with no leading/trailing blanks, I would expect

            doc_id      word n tf      idf   tf_idf
 792727 Text 33268 disposabl 1  1 11.67321 11.67321
 ###              ^         ^   one space each

but your output shows

             doc_id          word n  tf      idf    tf_idf
 792727  Text 33268     disposabl 1 1.0 11.67321 11.673214
                    ^^^^  four extra blanks

This means that your "^dispoabl$" is too restrictive. Try to filter (here) with:

x[grepl("disposabl$",x$text),]

removing the leading ^ and therefore allowing something before the d. Alternatives:

"\bdisposabl$" adds a word-boundary, so "adisposabl" won't match but "a disposabl" will still match;
"^\s*disposabl$" requires that anything leading be blank-space;
trim the blankspace with x[grepl("^disposabl$",trimws(x$text))], where your original pattern would have worked.

Why tf-idf truncates words?

Answers (1)

Related Questions