Reputation: 331
I have dataframe about the similarity of pair of products such as:
Product1 Product2 similarity
p1 p2 0.102
p1 p3 0.221
p1 p4 0.333
.....
p2 p1 0.102
p2 p3 0.201
p2 p4 0.242
I would like to choose the top 10 most similar product for each product, such as
product.pairs<-ddply(product.pairs, "product1", transform, rank = seq_along(product1))
product.pairs<-subset(product.pairs, rank<11,select=c(product1,product2))
this worked when the dataset is small, but once the product number reach 30k, it is too slow...
I also tried on sqldf, to mimic the rank & partition, such as...
sql_top10<-" select a.product1,a.product2, a.similarity,count(*) as rank from productpairs a join productpairs b on a.product1=b.product1 and a.similarity>=b.similarity group by a.product1,a.simlarity"
but this even worse... Any suggestions?
Upvotes: 1
Views: 259
Reputation: 49448
Use data.table
:
library(data.table)
dt = data.table(your_df)
# fast sort by similarity
setkey(dt, similarity)
# pick (at most) top 10 most similar ones
dt[, Product2[max(1, .N-9):.N], by = Product1]
Upvotes: 3