Ran Tao
Ran Tao

Reputation: 311

Remove duplicate text with same characters in R

My sample dataframe:

query <- c("women dress","dress women","dresses women","black women jean","women jeans black")
SearchVolume <- c(1000,1000,400,900,900)
PredictiveImpression <- c(900,900,200,700,700)
Lem <- c("women,dress","dress,women","dress 
women","black,women,jean","women,jean,black")

data <- data.frame(query,SearchVolume,PredictiveImpression,Lem)

I need to remove the query with (1) same characters - even though in different orders and singular/plural status; (2) same Search Volume and Predictive Impression. Eventually, "women dress", "dresses women" and "black women jeans" should stay.

I have used the lemmatization in r to extract the root words, but couldn't figure out how to deduplicate the query with same characters but different orders. Here is what I have accomplished now.

enter image description here

My expected result: enter image description here

Upvotes: 1

Views: 87

Answers (1)

akrun
akrun

Reputation: 887038

We can split the 'Lem' into a list of vectors, sort it, apply duplicated and subset

data[!duplicated(lapply(strsplit(as.character(data$Lem), ','), sort)),]
#           query SearchVolume PredictiveImpression              Lem
#1      women dress         1000                  900      women,dress
#3    dresses women          400                  200      dress women
#4 black women jean          900                  700 black,women,jean

Upvotes: 2

Related Questions