How to calculate user-similarity matrix in a more efficient manner?

Question

I have a set of 10 users, each with their own folder/directories, containing 25-30 images shared by them (in some social media, say). I want to calculate the similarities between the users based on the images shared by them.

For that, I use a feature extractor to convert each image into a 224x224x3 array, then loop through each user and each of the images in their folders to find the cosine similarity between each pair images, then take the average of all those pairwise image similarities for each pair of users to find the user similarity. (Please let me know if there's some mistake in this logic by the way).

My code to do all this is as follows:

from tensorflow.keras.applications.imagenet_utils import preprocess_input
from tensorflow.keras.applications import vgg16
from tensorflow.keras.preprocessing.image import load_img,img_to_array
from tensorflow.keras.models import Model

import os
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# load the model
vgg_model = vgg16.VGG16(weights='imagenet')

# remove the last layers in order to get features instead of predictions
feat_extractor = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)

def processed_image(image):
    original = load_img(image, target_size=(224, 224))
    numpy_image = img_to_array(original)
    image_batch = np.expand_dims(numpy_image, axis=0)
    processed_image = preprocess_input(image_batch.copy())
    img_features = feat_extractor.predict(processed_image)
    return img_features

def image_similarity(image1, image2):
    image1 = processed_image(image1)
    image2 = processed_image(image2)
    sim = cosine_similarity(image1, image2)
    return sim[0][0]

user_list = ['User '+str(i) for i in range(1,11)]
user_sim_df = pd.DataFrame(columns=user_list, index=user_list)
for user1 in user_list:
    for user2 in user_list:
        sum_img_sim = 0
        user1_files = [imgs_path + x for x in os.listdir('All_Users/'+user1) if "jpg" in x]
        user2_files = [imgs_path + x for x in os.listdir('All_Users/'+user2) if "jpg" in x]
        
        for image1 in user1_files:
            for image2 in user2_files:
                sum_img_sim += image_similarity(image1, image2)
        
        user_sim_df[user1][user2] = 2*sum_img_sim/(len(user1_files)+len(user2_files))

Now, because there are 4 for loops involved in calculating the user similarity matrix, the code take a long time too run (its been more than 30 minutes as of typing this question, that the code has been running for 10 users with 25-30 images each).

So, how do I rewrite the last portion of this to make the code run faster?

How to calculate user-similarity matrix in a more efficient manner?

Answers (1)

Related Questions