Reputation: 161
I have data in the following form:
Every second, for N seconds, I write M strings to a list [M(i), i = {1,..,N} is not necessarily equal to M(j), j = {1,..,N | j != i}]. I do this across 3 instances. That is, every second, I create 3 lists of strings of arbitrary number of strings, for a total of N seconds.
Now, I want to visually show how many strings are common in each list (each second) as (possibly) a correlation or similarity matrix. I want to repeat this for all N seconds. I am not sure how I can do this.
Suppose N = 3,
# instance 1
I1 = [['cat', 'dog', 'bob'], # 1st second
['eel', 'pug', 'emu'], # 2nd second
['owl', 'yak', 'elk']] # 3rd second
# instance 2
I2 = [['dog', 'fox', 'rat'], # 1st second
['emu', 'pug', 'ram'], # 2nd second
['bug', 'bee', 'bob']] # 3rd second
# instance 3
I3 = [['cat', 'bob', 'fox'], # 1st second
['emu', 'pug', 'eel'], # 2nd second
['bob', 'bee', 'yak']] # 3rd second
What is the best way to visualize the number of common elements at each second across the instances in Python? P.S., I can already plot this as a graph, but I am interested in creating a correlation or similarity matrix.
Upvotes: 0
Views: 462
Reputation: 126
You can iterate through and create your own similarity matrix and use matplotlib's imshow function to plot the matrix. For this approach it would be total similarity across seconds, otherwise you would need a 3-dimensional similarity matrix. That is doable, with the code below, but you would need to find another way to visualize it other than imshow
import numpy as np
import matplotlib.pyplot as plt
# instance 1
I1 = [['cat', 'dog', 'bob'], # 1st second
['eel', 'pug', 'emu'], # 2nd second
['owl', 'yak', 'elk']] # 3rd second
# instance 2
I2 = [['dog', 'fox', 'rat'], # 1st second
['emu', 'pug', 'ram'], # 2nd second
['bug', 'bee', 'bob']] # 3rd second
# instance 3
I3 = [['cat', 'bob', 'fox'], # 1st second
['emu', 'pug', 'eel'], # 2nd second
['bob', 'bee', 'yak']] # 3rd second
total = [I1, I2, I3]
# initialize similarity matrix by number of instances you have
sim_matrix = np.zeros(shape=(len(total), len(total)))
# constant per your explanation
N = 3
# for each row in sim matrix
for i in range(len(total)):
# for each column in sim matrix
for j in range(len(total)):
# if comparing itself
if i == j:
# similarity is total # of strings across all seconds (may not be constant)
sim_matrix[i, j] = sum([len(t) for t in total[i]])
else:
# sum up each set intersection of each list of strings at each second
sim_matrix[i, j] = sum([len(list(set(total[i][s]) & set(total[j][s]))) for s in range(N)])
sim_matrix
should be
array([[9., 3., 6.],
[3., 9., 5.],
[6., 5., 9.]])
You can plot this using imshow
plt.imshow(sim_matrix)
plt.colorbar()
plt.show()
There are almost certainly better and more efficient ways to do this, but if your number of lists is small, this is probably fine.
If you need similarity matrix at each second you could use the following modified code
sim_matrix = np.zeros(shape=(len(total), len(total), len(total)))
for i in range(len(total)):
for j in range(len(total)):
if i == j:
sim_matrix[:, i, j] = [len(t) for t in total[i]]
else:
sim_matrix[:, i, j] = [len(list(set(total[i][s]) & set(total[j][s]))) for s in range(N)]
You could use imshow
still for visualizing the 3-d similarity matrix but it will interpret each of the slices as RBG color channels.
Upvotes: 1