How can I optimize and apply this same logic for large datasets?

Question

I’m working on a system where I calculate the similarity between user vectors and product vectors using cosine similarity in Python with NumPy. The code below performs the necessary operations, but I need help optimizing it for large datasets. The goal is to improve performance and memory usage.

I am computing cosine similarity between user embeddings and product catalog embeddings. This works fine for small datasets, but for larger volumes of data (with millions of users and products), the current approach becomes inefficient.

Specifically, the dot product calculation and memory usage seem to be bottlenecks, especially when working with dense matrices.

What I’ve tried so far:

I’ve tried using np.argsort to find the nearest products but this becomes slow for large matrices.
I’ve also attempted reshaping and broadcasting user metadata, but the code’s memory usage grows significantly with larger datasets.

Code:

start_time = time.time()

# dot product between product and user vector
dot_pdt_vectors = (product_catalog_emb_vector @ user_emb_vector.T).T

# calculate norm over product vector
product_vector_norm = norm(product_catalog_emb_vector, axis=1)

# calculate norm over user vector
user_vector_norm = norm(user_emb_vector, axis=1)

# Compute Cosine Distance
similarity = (
    dot_pdt_vectors /
    (product_vector_norm * user_vector_norm[:, np.newaxis])
)
cosine_distance_vector = 1 - similarity

# Identify nearest product indexes and Select recommended products
product_recommended_idx = (
    np.argsort(cosine_distance_vector, axis=1)[:, :50]
)
product_recommended_vector = product_catalog_vector[product_recommended_idx]

# Define dimensions from previous vector
bq, r, _ = product_recommended_vector.shape

# Transform user metadata for add to product recommended vector
user_data_vector = (
    np.repeat(
        user_metadata_vector, r, axis=0
    ).reshape(
        bq, r, user_metadata_vector.shape[-1]
    )
)

# Add user metadata to product recommended vector
product_recommended_vector = np.concatenate(
    (user_data_vector, product_recommended_vector), axis=2
)
cols_pr_v = product_recommended_vector.shape[-1]

# Reshape Product Recommended Vector
product_recommended_vector = (
    product_recommended_vector.reshape(-1, cols_pr_v)
)
logger.info(
    f'Product Recommended Vector Succesfully Created: {product_recommended_vector.shape}'
)
logger.info(
    f'Function execution time: {(time.time() - start_time):.4f} seconds'
)

What I need help with:

Optimization: How can I optimize the cosine similarity calculation and reduce memory usage for large datasets? Should I consider sparse matrices, batch processing, or alternative algorithms?
Efficient memory usage: Are there more memory-efficient ways to handle these embeddings (e.g., using np.memmap, or distributed frameworks like Dask)?
Scaling: How can I scale this to handle millions of users and products efficiently?

The point is, what is the best approach to perform such calculations with larger datasets and dense matrices?

What alternative approaches can I explore?

Any suggestions or advice on optimizing this code for large-scale data would be greatly appreciated!

Thanks!

How can I optimize and apply this same logic for large datasets?

What I’ve tried so far:

What I need help with:

Answers (0)

Related Questions