Reputation: 65
In text2vec
package, I am using create_vocabulary function. For eg:
My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be
This, book, is, very, good, This book,..... book is very, very good. I just want to remove the term "book is very" (and host of other terms using a vector). Since I just want to remove a phrase I cant use stopwords. I have coded the below code:
vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.
x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)
When I do the above step, the metainformation in attributes get lost in vocab_mod and so can't be used in create_dtm
.
Upvotes: 1
Views: 424
Reputation: 4595
It seems that subset
function drops some attributes. You can try:
library(text2vec)
txt = "This book is very good"
it = itoken(txt)
v = create_vocabulary(it, ngram = c(1, 3))
v = v[!(v$term %in% "is_very_good"), ]
v
# Number of docs: 1
# 0 stopwords: ...
# ngram_min = 1; ngram_max = 3
# Vocabulary:
# term term_count doc_count
# 1: good 1 1
# 2: book_is_very 1 1
# 3: This_book 1 1
# 4: This 1 1
# 5: book 1 1
# 6: very_good 1 1
# 7: is_very 1 1
# 8: book_is 1 1
# 9: This_book_is 1 1
# 10: is 1 1
# 11: very 1 1
dtm = create_dtm(it, vocab_vectorizer(v))
Upvotes: 1
Reputation: 65
@Dmitriy even this lets to drop the attributes... So the way out that I found was just adding the attributes manually for now using attr function
attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) and son one for other attributes as well. We can get attribute details from vocab.
Upvotes: 0