Reputation: 492
I'm trying to build a simple user based collaborative filtering
in Django
for an E-commerce
using just the purchase history.
Here are the steps I use, I know it needs more improvements but I've no idea what's the next move.
here's the product model
class Product(models.Model):
name = models.CharField(max_length=100)
description = models.TextField()
here's the purcashe model
class Purchase(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
product = models.ForeignKey(Product, on_delete=models.CASCADE)
purchase_date = models.DateTimeField(auto_now_add=True)
Now to get similar users
def find_similar_users(user, k=5):
all_users = User.objects.exclude(id=user.id)
similarities = [(other_user, jaccard_similarity(user, other_user)) for other_user in all_users]
similarities.sort(key=lambda x: x[1], reverse=True)
return [user_similarity[0] for user_similarity in similarities[:k]]
and to calculate similarity between each:
def jaccard_similarity(user1, user2):
user1_purchases = set(Purchase.objects.filter(user=user1).values_list('product_id', flat=True))
user2_purchases = set(Purchase.objects.filter(user=user2).values_list('product_id', flat=True))
intersection = user1_purchases.intersection(user2_purchases)
union = user1_purchases.union(user2_purchases)
return len(intersection) / len(union) if len(union) > 0 else 0
now here's my entry function:
def recommend_products(user, k=5):
similar_users = find_similar_users(user, k)
recommended_products = set()
for similar_user in similar_users:
purchases = Purchase.objects.filter(user=similar_user).exclude(product__in=recommended_products)
for purchase in purchases:
recommended_products.add(purchase.product)
return recommended_products
Now, obviously that'd be really slow, I was thinking of using a copy of the data in another no-sql
database.
Now if user A
purchase something, I copy the data to the other database, do the calculation and store the returned similar products "obviously using background service like celery" in the no-sql database, and just retrieve them later for user A
if needed, is that the right approach?
Upvotes: 2
Views: 161
Reputation: 477607
You can boost efficency a lot with:
def find_similar_users(user, k=5):
all_users = User.objects.exclude(id=user.id).prefetch_related('purchase_set')
similarities = [
(other_user, jaccard_similarity(user, other_user))
for other_user in all_users
]
def jaccard_similarity(user1, user2):
user1_purchases = {
purchase.product_id for purchase in user1.purchase_set.all()
}
user1_purchases = {
purchase.product_id for purchase in user2.purchase_set.all()
}
intersection = user1_purchases.intersection(user2_purchases)
union = user1_purchases.union(user2_purchases)
return len(intersection) / len(union) if len(union) > 0 else 0
This will retrieve all Purchase
s in "bulk" and thus only make two queries, which is probably where the bottleneck is anyway.
Upvotes: 1