Pandas dataframe operations too slow on dataset with a lot of columns

Question

I have a similarity matrix (pandas Dataframe) and I want to go through each product and get the most 5 similar products, then put them in a final Dataframe called itemAffinity but as the similarity matrix has 31878 items(products)=> means 31878 columns and 31878 rows . Executing the below function, could not be finished (takes a lot of time).

def get_items_similarity_score(similarity_matrix):
    products_list = similarity_matrix.columns.values.tolist()
    #Create an empty data frame to store item affinity scores for items.
    itemAffinity= pd.DataFrame(columns=('item1', 'item2', 'score'))
    rowCount=0
    for item in products_list:
        #get top 5 similar products which are not item
        if isinstance(item,int):
            series_sim = similarity_matrix.loc[item].nlargest(6)
            #print series_sim
            df = pd.DataFrame({'product':series_sim.index, 'score':series_sim.values})
            df = df[df['product'] != item]
            for r in range(len(df)):
                itemAffinity.loc[rowCount] = [item,df.iloc[r]['product'],df.iloc[r]['score']]
                rowCount +=1
                itemAffinity.sort_values("score", ascending=False, inplace=True)
    return itemAffinity

the function that I used to generate the similarity matrix:

def calculate_similarity(data_items):
"""Calculate the column-wise cosine similarity for a sparse
matrix. Return a new dataframe matrix with similarities.
"""
data_sparse = sparse.csr_matrix(data_items)
#pairwise similarities between all samples in data_sparse.transpose()
similarities = cosine_similarity(data_sparse.transpose())
sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
return sim

is there a way to have the expected result but more effective performance?

Pandas dataframe operations too slow on dataset with a lot of columns

Answers (1)

Related Questions