Reputation: 311
My sample dataframe:
query <- c("women dress","dress women","dresses women","black women jean","women jeans black")
SearchVolume <- c(1000,1000,400,900,900)
PredictiveImpression <- c(900,900,200,700,700)
Lem <- c("women,dress","dress,women","dress
women","black,women,jean","women,jean,black")
data <- data.frame(query,SearchVolume,PredictiveImpression,Lem)
I need to remove the query with (1) same characters - even though in different orders and singular/plural status; (2) same Search Volume and Predictive Impression. Eventually, "women dress", "dresses women" and "black women jeans" should stay.
I have used the lemmatization in r to extract the root words, but couldn't figure out how to deduplicate the query with same characters but different orders. Here is what I have accomplished now.
Upvotes: 1
Views: 87
Reputation: 887038
We can split the 'Lem' into a list
of vector
s, sort
it, apply duplicated
and subset
data[!duplicated(lapply(strsplit(as.character(data$Lem), ','), sort)),]
# query SearchVolume PredictiveImpression Lem
#1 women dress 1000 900 women,dress
#3 dresses women 400 200 dress women
#4 black women jean 900 700 black,women,jean
Upvotes: 2