Reputation: 598
I cluster dataframe indices based on a similarity matrix multiple times (trials) and store the cluster assignments in a dataframe as follows:
trial 0 trial 1 trial 2 trial 3
index 0 0 1 0 0
index 1 0 1 1 0
index 2 2 0 2 0
index 3 1 2 2 1
Noise is added to the similarity matrix before each trial so the cluster assignments are non-deterministic (hence the difference in assignments each trial). So just to be clear: Each trial corresponds to a full clustering run, and the values correspond to a cluster for that trial.
In the example above index 0
and index 1
co-occur in the same cluster three times.
What I want is a co-occurrence matrix like so:
index 0 index 1 index 2 index 3
index 0 4 3 1 0
index 1 3 4 1 0
index 2 1 1 4 1
index 3 0 0 1 4
Where each value corresponds to the number of clusters the indices co-occur across all trials.
Is there an efficient way to do this in Pandas? I can manage it with loops easily enough but my trials dataframe has several thousand indices and trials.
Upvotes: 2
Views: 714
Reputation: 598
I figured out how to do it with a bit of linear algebra.
First, the trial matrix is decomposed into a sum corresponding to each number (the cluster numbers should start at 1 in order to formulate the method mathematically though this isn't necessary in the implementation).
That is:
trial 0 trial 1 trial 2 trial 3
index 0 0 1 0 0
index 1 0 1 1 0
index 2 2 0 2 0
index 3 1 2 2 1
becomes
trial 0 trial 1 trial 2 trial 3
index 0 1 2 1 1
index 1 1 2 2 1
index 2 3 1 3 1
index 3 2 3 3 2
(incremented by one), which is decomposed as follows:
T = 1 0 1 1 + 2 * 0 1 0 0 + 3 * 0 0 0 0
1 0 0 1 0 1 1 0 0 0 0 0
0 1 0 1 0 0 0 0 1 0 1 0
0 0 0 0 1 0 0 1 0 1 1 0
Then each (normalized) component matrix is multiplied by its transpose and summed together:
C1*C1.T/1 + C2*C2.T/2 + C3*C3.T/3
Where Ci
is the matrix component of T
corresponding to the cluster number i
.
This sum is then the resultant co-occurrence matrix. Below is the implementation and the result for the above example:
test = pd.DataFrame(np.array([[0, 1, 0, 0],
[0, 1, 1, 0],
[2, 0, 2, 0],
[1, 2, 2, 1]]),
columns = ['trial 1', 'trial 2', 'trial 3', 'trial 4'])
test_val = test.values
# Base matrix that will be added to.
curr_mat = np.zeros((test_val.shape[0], test_val.shape[0]))
# Max index of matrix components (i.e. max_val + 1 is number of clusters/matrix components)
max_val = np.max(test_val)
for n_clus in range(max_val + 1):
# Extract component matrix corresponding to current iteration.
clus_mem = (test_val == n_clus).astype(int)
curr_mat += np.dot(clus_mem, clus_mem.T)
res = pd.DataFrame(curr_mat, index=test.index, columns=test.index)
With the result:
index 0 index 1 index 2 index 3
index 0 4 3 1 0
index 1 3 4 1 0
index 2 1 1 4 1
index 3 0 0 1 4
Unfortunately I had to use a for
loop but the number of iterations is now only the number of clusters and I take advantage of numpy's efficient array operations.
Upvotes: 1
Reputation: 11602
Here is a solution that requires looping over columns only.
res = sum(df[c].transform(lambda x: x == df[c]) for c in df.columns)
However, if your data is fairly sparse, using loops or graphs may end up being faster.
Upvotes: 1