SanchoH
SanchoH

Reputation: 1

How can I optimize and apply this same logic for large datasets?

I’m working on a system where I calculate the similarity between user vectors and product vectors using cosine similarity in Python with NumPy. The code below performs the necessary operations, but I need help optimizing it for large datasets. The goal is to improve performance and memory usage.

I am computing cosine similarity between user embeddings and product catalog embeddings. This works fine for small datasets, but for larger volumes of data (with millions of users and products), the current approach becomes inefficient.

Specifically, the dot product calculation and memory usage seem to be bottlenecks, especially when working with dense matrices.

What I’ve tried so far:

Code:

start_time = time.time()

# dot product between product and user vector
dot_pdt_vectors = (product_catalog_emb_vector @ user_emb_vector.T).T

# calculate norm over product vector
product_vector_norm = norm(product_catalog_emb_vector, axis=1)

# calculate norm over user vector
user_vector_norm = norm(user_emb_vector, axis=1)

# Compute Cosine Distance
similarity = (
    dot_pdt_vectors /
    (product_vector_norm * user_vector_norm[:, np.newaxis])
)
cosine_distance_vector = 1 - similarity

# Identify nearest product indexes and Select recommended products
product_recommended_idx = (
    np.argsort(cosine_distance_vector, axis=1)[:, :50]
)
product_recommended_vector = product_catalog_vector[product_recommended_idx]

# Define dimensions from previous vector
bq, r, _ = product_recommended_vector.shape

# Transform user metadata for add to product recommended vector
user_data_vector = (
    np.repeat(
        user_metadata_vector, r, axis=0
    ).reshape(
        bq, r, user_metadata_vector.shape[-1]
    )
)

# Add user metadata to product recommended vector
product_recommended_vector = np.concatenate(
    (user_data_vector, product_recommended_vector), axis=2
)
cols_pr_v = product_recommended_vector.shape[-1]

# Reshape Product Recommended Vector
product_recommended_vector = (
    product_recommended_vector.reshape(-1, cols_pr_v)
)
logger.info(
    f'Product Recommended Vector Succesfully Created: {product_recommended_vector.shape}'
)
logger.info(
    f'Function execution time: {(time.time() - start_time):.4f} seconds'
)

What I need help with:

The point is, what is the best approach to perform such calculations with larger datasets and dense matrices?

What alternative approaches can I explore?

Any suggestions or advice on optimizing this code for large-scale data would be greatly appreciated!

Thanks!

Upvotes: 0

Views: 52

Answers (0)

Related Questions