an4s
an4s

Reputation: 161

What is the most ideal way to visually show similarities between lists?

I have data in the following form:

Every second, for N seconds, I write M strings to a list [M(i), i = {1,..,N} is not necessarily equal to M(j), j = {1,..,N | j != i}]. I do this across 3 instances. That is, every second, I create 3 lists of strings of arbitrary number of strings, for a total of N seconds.

Now, I want to visually show how many strings are common in each list (each second) as (possibly) a correlation or similarity matrix. I want to repeat this for all N seconds. I am not sure how I can do this.

Suppose N = 3,

# instance 1
I1 = [['cat', 'dog', 'bob'], # 1st second
      ['eel', 'pug', 'emu'], # 2nd second
      ['owl', 'yak', 'elk']] # 3rd second
# instance 2
I2 = [['dog', 'fox', 'rat'], # 1st second
      ['emu', 'pug', 'ram'], # 2nd second
      ['bug', 'bee', 'bob']] # 3rd second
# instance 3
I3 = [['cat', 'bob', 'fox'], # 1st second
      ['emu', 'pug', 'eel'], # 2nd second
      ['bob', 'bee', 'yak']] # 3rd second

What is the best way to visualize the number of common elements at each second across the instances in Python? P.S., I can already plot this as a graph, but I am interested in creating a correlation or similarity matrix.

Upvotes: 0

Views: 462

Answers (1)

nepdavis
nepdavis

Reputation: 126

You can iterate through and create your own similarity matrix and use matplotlib's imshow function to plot the matrix. For this approach it would be total similarity across seconds, otherwise you would need a 3-dimensional similarity matrix. That is doable, with the code below, but you would need to find another way to visualize it other than imshow

import numpy as np
import matplotlib.pyplot as plt

# instance 1
I1 = [['cat', 'dog', 'bob'], # 1st second
      ['eel', 'pug', 'emu'], # 2nd second
      ['owl', 'yak', 'elk']] # 3rd second
# instance 2
I2 = [['dog', 'fox', 'rat'], # 1st second
      ['emu', 'pug', 'ram'], # 2nd second
      ['bug', 'bee', 'bob']] # 3rd second
# instance 3
I3 = [['cat', 'bob', 'fox'], # 1st second
      ['emu', 'pug', 'eel'], # 2nd second
      ['bob', 'bee', 'yak']] # 3rd second

total = [I1, I2, I3]

# initialize similarity matrix by number of instances you have
sim_matrix = np.zeros(shape=(len(total), len(total)))

# constant per your explanation
N = 3

# for each row in sim matrix
for i in range(len(total)):

    # for each column in sim matrix
    for j in range(len(total)):

        # if comparing itself
        if i == j:

            # similarity is total # of strings across all seconds (may not be constant)
            sim_matrix[i, j] = sum([len(t) for t in total[i]])

        else:

            # sum up each set intersection of each list of strings at each second
            sim_matrix[i, j] = sum([len(list(set(total[i][s]) & set(total[j][s]))) for s in range(N)])

sim_matrix should be

array([[9., 3., 6.],
       [3., 9., 5.],
       [6., 5., 9.]])

You can plot this using imshow

plt.imshow(sim_matrix)
plt.colorbar()
plt.show()

enter image description here

There are almost certainly better and more efficient ways to do this, but if your number of lists is small, this is probably fine.

Edit

If you need similarity matrix at each second you could use the following modified code

sim_matrix = np.zeros(shape=(len(total), len(total), len(total)))

for i in range(len(total)):

    for j in range(len(total)):

        if i == j:

            sim_matrix[:, i, j] = [len(t) for t in total[i]]

        else:

            sim_matrix[:, i, j] = [len(list(set(total[i][s]) & set(total[j][s]))) for s in range(N)]

You could use imshow still for visualizing the 3-d similarity matrix but it will interpret each of the slices as RBG color channels.

Upvotes: 1

Related Questions