Kristada673
Kristada673

Reputation: 3744

How to calculate user-similarity matrix in a more efficient manner?

I have a set of 10 users, each with their own folder/directories, containing 25-30 images shared by them (in some social media, say). I want to calculate the similarities between the users based on the images shared by them.

For that, I use a feature extractor to convert each image into a 224x224x3 array, then loop through each user and each of the images in their folders to find the cosine similarity between each pair images, then take the average of all those pairwise image similarities for each pair of users to find the user similarity. (Please let me know if there's some mistake in this logic by the way).

My code to do all this is as follows:

from tensorflow.keras.applications.imagenet_utils import preprocess_input
from tensorflow.keras.applications import vgg16
from tensorflow.keras.preprocessing.image import load_img,img_to_array
from tensorflow.keras.models import Model

import os
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# load the model
vgg_model = vgg16.VGG16(weights='imagenet')

# remove the last layers in order to get features instead of predictions
feat_extractor = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)

def processed_image(image):
    original = load_img(image, target_size=(224, 224))
    numpy_image = img_to_array(original)
    image_batch = np.expand_dims(numpy_image, axis=0)
    processed_image = preprocess_input(image_batch.copy())
    img_features = feat_extractor.predict(processed_image)
    return img_features

def image_similarity(image1, image2):
    image1 = processed_image(image1)
    image2 = processed_image(image2)
    sim = cosine_similarity(image1, image2)
    return sim[0][0]

user_list = ['User '+str(i) for i in range(1,11)]
user_sim_df = pd.DataFrame(columns=user_list, index=user_list)
for user1 in user_list:
    for user2 in user_list:
        sum_img_sim = 0
        user1_files = [imgs_path + x for x in os.listdir('All_Users/'+user1) if "jpg" in x]
        user2_files = [imgs_path + x for x in os.listdir('All_Users/'+user2) if "jpg" in x]
        
        for image1 in user1_files:
            for image2 in user2_files:
                sum_img_sim += image_similarity(image1, image2)
        
        user_sim_df[user1][user2] = 2*sum_img_sim/(len(user1_files)+len(user2_files))

Now, because there are 4 for loops involved in calculating the user similarity matrix, the code take a long time too run (its been more than 30 minutes as of typing this question, that the code has been running for 10 users with 25-30 images each).

So, how do I rewrite the last portion of this to make the code run faster?

Upvotes: 0

Views: 360

Answers (1)

dhasson
dhasson

Reputation: 248

Nested for loops are particularly bad for Python, but some work can be done to improve here.

First of all, you are doing work twice in the comparisons. user_sim_df[user_i][user_j] has the same value as user_sim_df[user_j][user_i] for all pairs i, j. Could benefit from using already calculated values, instead of computing them again in later iterations. Besides this, is computing the values on the diagonal (user_sim_df[user_i][user_i]) necessary for your application?

These simple changes will reduce execution time to half. Is that enough? Maybe not. Further lines of improvement:

  1. the img_to_array() operation is being applied many times on every image (every time you calculate similarity with another one). Is it a bottleneck? In that case, performance could also improve if you first run a loop on all images and create a new file ready for numpy to read later, for example with numpy.read() - or maybe, just save the preprocessed files output from the Tensorflow currently being used.
  1. if you're using the standard Python interpreter, changing to PyPy can help (in general). You could also try adapting the code to consist only of operations on numpy structures (e.g. adapt the pandas parts) and use Numba in a way similar to this SO link. Using Numba you can also benefit from parallelism. See some practical guidelines here.

Upvotes: 1

Related Questions