Ashwini Khedkar
Ashwini Khedkar

Reputation: 21

getting error while forming train matrix in book recommendation system

I am new to data science and facing issues while creating a book recommendation system by collaborative filtering. Can someone please advise on the below error.

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

data  = pd.read_csv('BX-Book-Ratings.csv',engine = 'python')
df = data.iloc[1:10000,:]
print(df)
print(df.dtypes)
df['isbn']= pd.to_numeric(df['isbn'], errors = 'coerce')
df = df[np.isfinite(df).all(1)]
df['isbn'] = df['isbn'].astype(np.int64)

from sklearn.model_selection import train_test_split
n_users = df.user_id.unique().shape[0] 
n_book = df.isbn.unique().shape[0]
train_data, test_data = train_test_split(df, test_size=0.5)
print(n_users , n_book)
train_data_matrix = np.zeros((n_users, n_book))
for line in train_data.itertuples():
    #[user_id index, book_id index] = given rating.
    train_data_matrix[line[1] - 1, line[2] - 1] = line[3] 
train_data_matrix
--------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-125-caa0bcd40167> in <module>
      2 for line in train_data.itertuples():
      3     #[user_id index, book_id index] = given rating.
----> 4     train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
      5 train_data_matrix

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Upvotes: 2

Views: 103

Answers (1)

ved prakash
ved prakash

Reputation: 31

The Most probable cause of error is the index value are having mismatch. I can see ISBN is int type but what about user_id??

Fix:

The fix is to create an unique index for the these n_users * n_book.

  • Method1 : this can be created either using another unique dataframe for consumer and item and use its index.

  • Method2 : create a dict and use unique values as key and some index.

Now whatever method is used should be consistent across rest of process or it will result in mismatch of book-item rating.

This fix uses method 2.

# Method2
user_dict= {}
for item,value in enumerate(df.user_id.unique().tolist()):
    consumer_dict[value]= item



book_dict = {}
for item, value in enumerate(df.isbn.unique().tolist()):
    item_dict[value] = item                       

print(len(user_dict.keys()), len(book_dict.keys()))

for line in train.itertuples():
    row_index = user_dict[line[1]]
    col_index = book_dict[line[2]]
    data_matrix[row_index, col_index] = line[3]

Hope This Helps , Snapshot of data will probably help to fix this.

Upvotes: 1

Related Questions