Reputation: 1769
I have a dataframe x
that is:
> str(x)
'data.frame': 117654 obs. of 2 variables:
$ text : chr "more about " ...
$ doc_id: chr "Text 1" "Text 2" "Text 3" "Text 4" ...
I can't report it here, with dput
, because it is too large. I'm trying to estimate the TF-IDF and I wrote the code:
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- x %>%
mutate(text = as.character(text)) %>%
unnest_tokens(output = word, input = text) %>%
count(doc_id, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(term = word, document = doc_id, n)
book_words<-book_words[order(book_words$tf_idf,decreasing=FALSE),]
book_words = book_words[!duplicated(book_words$word),]
Anyway, I noticed that some words appear to be truncated in book_words
. For example:
doc_id word n tf idf tf_idf
792727 Text 33268 disposabl 1 1.0 11.67321 11.673214
I'm sure this is a truncated term because, if I run:
x[grepl("^disposabl$",x$text),]
I obtain no rows.
This ever happen to you?
Upvotes: 0
Views: 70
Reputation: 160437
From your output, it appears that there is leading blank-space in the name. If it were just "dispoabl"
with no leading/trailing blanks, I would expect
doc_id word n tf idf tf_idf
792727 Text 33268 disposabl 1 1 11.67321 11.67321
### ^ ^ one space each
but your output shows
doc_id word n tf idf tf_idf
792727 Text 33268 disposabl 1 1.0 11.67321 11.673214
^^^^ four extra blanks
This means that your "^dispoabl$"
is too restrictive. Try to filter (here) with:
x[grepl("disposabl$",x$text),]
removing the leading ^
and therefore allowing something before the d
. Alternatives:
"\\bdisposabl$"
adds a word-boundary, so "adisposabl"
won't match but "a disposabl"
will still match;"^\\s*disposabl$"
requires that anything leading be blank-space;x[grepl("^disposabl$",trimws(x$text))]
, where your original pattern would have worked.Upvotes: 2