Reputation: 391
I am working on the data frame which contains data per row doc number and text only. This data was exported from xml file. The data is of form dataframe in variable text_df
:
line/ text
1 when uploading objective file bugzilla se
2 spelling mistake docs section searching fo…
3 editparams cgi won save updates iis instal…
4 editparams cgi won save updates
5 rfe unsubscribe from bug you reported
6 unsubscribe from bug you reported
I am using the following code to identify and remove the duplicates.
doc_set_1 = text_df
it1 = itoken(doc_set_1$text, progressbar = FALSE)
# specially take different number of docs in second set
doc_set_2 = text_df
it2 = itoken(doc_set_2$text, progressbar = FALSE)
it = itoken(text_df$text, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max =
0.1, term_count_min = 5)
vectorizer = vocab_vectorizer(v)
dtm1 = create_dtm(it1, vectorizer)
dtm2 = create_dtm(it2, vectorizer)
d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
mat<-(d1_d2_cos_sim)
mat[lower.tri(mat,diag=TRUE)] <- 0
## for converting a sparse matrix into dataframe
mdf<- as.data.frame(as.matrix(mat))
datalist = list()
for (i in 1:nrow(mat)) {
t<-which(mat[i,]>0.8)
if(length(t)>1){
datalist[[i]] <- t # add it to your list
}
}
#Number of Duplicates Found
length(unique(unlist(datalist)))
tmdf<- subset(mdf,select=-c(unique(unlist(datalist))))
# Removing the similar documents
text_df<-text_df[names(tmdf),]
nrow(text_df)
This code takes lot of time for solving, Any suggestions to make it better are welcome.
Upvotes: 0
Views: 376
Reputation: 2829
the library quanteda
works quite well on this case. Here below I provide an example:
library(tibble)
library(quanteda)
df<- data_frame(text = c("when uploading objective file bugzilla se",
"spelling mistake docs section searching fo",
"editparams cgi won save updates iis instal",
"editparams cgi won save updates",
"rfe unsubscribe from bug you reported",
"unsubscribe from bug you reported"))
DocTerm <- quanteda::dfm(df$text)
textstat_simil(DocTerm, margin="documents", method = "cosine")
text1 text2 text3 text4 text5
text2 0.0000000
text3 0.0000000 0.0000000
text4 0.0000000 0.0000000 0.8451543
text5 0.0000000 0.0000000 0.0000000 0.0000000
text6 0.0000000 0.0000000 0.0000000 0.0000000 0.9128709
textstat_simil(DocTerm, margin="documents", method = "cosine")
If one wants to subset it by an specific amount and see which ones are similar than a specific number (in this 0.9), one can do as following:
mycosinesim<-textstat_simil(DocTerm, margin="documents", method = "cosine")
myMatcosine<-as.data.frame(as.matrix(mycosinesim))
higherthan90<-as.data.frame(which(myMatcosine>0.9,arr.ind = T, useNames = T))
higherthan90[which(higherthan90$row !=higherthan90$col),]
row col
text6 6 5
text5.1 5 6
Now you can decide whether to remove 5 or 6 as text since they are really similar
Upvotes: 1