Reputation: 73
I want to compute correlation percentages between multiple items that appear in log files. In doing so, I get the number of times they appear divided by the number of times they appear while another item was present. I won't go too much in the details but this correlation is not symmetrical (The correlation between A and B is not the same as between B and A)
As an output I have a dictionary that has a format like this one :
{
itemA: {
itemB: 0.85,
itemC: 0.12
},
itemB: {
itemC: 0.68,
itemA: 0.24
},
itemC: {
itemA: 0.28
}
}
I have tried working with DictVectorizer
from sklearn
but it doesn't work since it requires a list of dictionaries.
I would like the output to be a matrix for visualisation with matplotlib
something like this :
[[1,0.85,0.12]
[0.68,1,0.24]
[0.28,0,1]]
If possible, I would also like to have a matplotlib visualisation with a legend for each line and column, since my dict has way more than 3 items.
I hope that everything is clear. Thank you for your help.
Upvotes: 0
Views: 621
Reputation: 3355
You can do this efficiently with pandas and numpy:
import pandas as pd
d = {
'itemA': {
'itemB': 0.85,
'itemC': 0.12
},
'itemB': {
'itemA': 0.68,
'itemC': 0.24
},
'itemC': {
'itemA': 0.28
}
}
df = pd.DataFrame(d)
# since this is a matrix of co-occurrences of a set of objects,
# sort columns and rows alphabetically
df = df.sort_index(axis=0)
df = df.sort_index(axis=1)
# the matrix is now the values of the dataframe
a = df.values.T
# if needed, fill the diagonal with 1 and replace NaN with 0
import numpy as np
np.fill_diagonal(a, 1)
a[np.isnan(a)] = 0
The matrix now is:
array([[1. , 0.85, 0.12],
[0.68, 1. , 0.24],
[0.28, 0. , 1. ]])
To visualize this matrix:
import matplotlib.pyplot as plt
plt.matshow(a)
plt.show()
The row and column ids will be shown as labels.
Upvotes: 1
Reputation: 3745
Here is a code that work with an array, but you can easily adapt it to the sequence you want to use.
dictionary = {
'itemA': {
'itemB': 0.85,
'itemC': 0.12
},
'itemB': {
'itemA': 0.68,
'itemC': 0.24
},
'itemC': {
'itemA': 0.28
}
}
matrix = []
i = 0
for v in dictionary.values():
tmp_mat = []
for h in v.values():
if len(tmp_mat) == i:
tmp_mat.append(1)
tmp_mat.append(h)
i += 1
if len(tmp_mat) == len(v):
tmp_mat.append(1)
matrix.append(tmp_mat)
print(matrix)
[[1, 0.85, 0.12], [0.68, 1, 0.24], [0.28, 1]]
unpacking keys and values of a dictionary
Upvotes: 0