Reputation: 234
This question should not be new, but I just cannot find it... forgive me for asking a repeated question.
Anyway content-based recommendation system requires us to create feature vectors for the items we are recommending. So we have two issues we need to solve to begin with: 1. what components are important enough that should be included in the feature vector, which represents an item? 2. once we decide all the components in the vector, who is responsible for populating the values?
Using movie as the most popular example, we probably decide to user actors, director(s) and genre as the components in the vector. Now, for each movie in the past many years (there are lots of movies out there), how can we populate all these components to prepare the raw data for the vectors? manually? automatically (how)?
I could have missed something. Seems like whenever we decide to do content-based systems, we need to solve these issues, which are not easy to address. Now, it seems almost like collaborative filtering it easier, since it only needs the utility matrix (user-item matrix), and it does not require us to generate all the feature vectors. Of course, utility matrix contains user ratings, which would be another headache to obtain.
Could someone share some thoughts on this? many thanks!
Upvotes: 1
Views: 1978
Reputation: 441
I had to build a content-based recommendation system that should be able to take any e-commerce catalog as input and provide recommendations. Since the attributes the catalog is not known beforehand, it had to be general-purpose. I took an approach similar to the one described in the above answers.
I used tf-idf
with ngrams
to vectorize the fields and cosine distance metric to get top-n recommedations.
A detailed writeup of the approach can be found here, and the code in this notebook
Upvotes: 0
Reputation: 1
When building a recommender system there is never a wrong or right approach to doing things. It's what works best for your particular scenario, and that may be getting a higher accuracy score during evaluation stage or generating the most revenue. When choosing features/attributes for your items on content based recommenders, it's good to understand and get behind the data, but more importantly use your intuition on what you think may give the most meaning and value to the item. How you choose your features will determine how good your recommender will perform. Once you have chosen your features, you can transform these values into a vector space.
In the context of items being movies, and you have features like name, actors, authors and a description, you can simply apply a TF-IDF approach which will convert text values into numerical values basically producing a high dimensional vector. Now that you have produced a vector space you can use several distance measures (cosine, euclidean, manhattan) to find the similarity between items and rank them according to the least minimum distance. Here you can now recommend similiar items based on an item.
This is just a high level approach to creating a simple similarity measure however there are numerous ways to increase the complexity of the recommender sysem throughout the feature selection process.
Upvotes: 0
Reputation: 201
In content based filtering what you use is usually the ICM (Item content matrix) or the UCM (user content matrix) depending on what you are computing the similatity on (users or items). The ICM (and/or the UCM) can be populated if the attributes of items (or users) are given. Then, if you have this information you can build the matrix. Suppose you are given categorical attributes like genre, actors, director you can do 1 hot encoding to obtain your matrix. Once you have it you can perform:
1) Feature selection (this was your first issue, "what components are important enough that should be included in the feature vector")
2) Some weighting scheme on features, e.g. tf-idf (this, together with the first part of the answer partially answer to who and how should populating the values).
Upvotes: 1