duncster94
duncster94

Reputation: 598

Pandas: Counting index pairwise occurrences of same value in a dataframe

I cluster dataframe indices based on a similarity matrix multiple times (trials) and store the cluster assignments in a dataframe as follows:

        trial 0  trial 1  trial 2  trial 3
index 0    0        1        0        0
index 1    0        1        1        0
index 2    2        0        2        0
index 3    1        2        2        1

Noise is added to the similarity matrix before each trial so the cluster assignments are non-deterministic (hence the difference in assignments each trial). So just to be clear: Each trial corresponds to a full clustering run, and the values correspond to a cluster for that trial.

In the example above index 0 and index 1 co-occur in the same cluster three times.

What I want is a co-occurrence matrix like so:

        index 0  index 1  index 2  index 3
index 0    4        3        1        0   
index 1    3        4        1        0
index 2    1        1        4        1
index 3    0        0        1        4

Where each value corresponds to the number of clusters the indices co-occur across all trials.

Is there an efficient way to do this in Pandas? I can manage it with loops easily enough but my trials dataframe has several thousand indices and trials.

Upvotes: 2

Views: 714

Answers (2)

duncster94
duncster94

Reputation: 598

I figured out how to do it with a bit of linear algebra.

First, the trial matrix is decomposed into a sum corresponding to each number (the cluster numbers should start at 1 in order to formulate the method mathematically though this isn't necessary in the implementation).

That is:

        trial 0  trial 1  trial 2  trial 3
index 0    0        1        0        0
index 1    0        1        1        0
index 2    2        0        2        0
index 3    1        2        2        1

becomes

        trial 0  trial 1  trial 2  trial 3
index 0    1        2        1        1
index 1    1        2        2        1
index 2    3        1        3        1
index 3    2        3        3        2

(incremented by one), which is decomposed as follows:

T = 1  0  1  1  +  2 * 0  1  0  0  + 3 * 0  0  0  0
    1  0  0  1         0  1  1  0        0  0  0  0
    0  1  0  1         0  0  0  0        1  0  1  0
    0  0  0  0         1  0  0  1        0  1  1  0

Then each (normalized) component matrix is multiplied by its transpose and summed together:

C1*C1.T/1 + C2*C2.T/2 + C3*C3.T/3

Where Ci is the matrix component of T corresponding to the cluster number i.

This sum is then the resultant co-occurrence matrix. Below is the implementation and the result for the above example:

test = pd.DataFrame(np.array([[0, 1, 0, 0], 
                              [0, 1, 1, 0], 
                              [2, 0, 2, 0], 
                              [1, 2, 2, 1]]), 
                    columns = ['trial 1', 'trial 2', 'trial 3', 'trial 4'])
test_val = test.values

# Base matrix that will be added to.
curr_mat = np.zeros((test_val.shape[0], test_val.shape[0]))

# Max index of matrix components (i.e. max_val + 1 is number of clusters/matrix components)
max_val = np.max(test_val)

for n_clus in range(max_val + 1):

    # Extract component matrix corresponding to current iteration.
    clus_mem = (test_val == n_clus).astype(int)
    curr_mat += np.dot(clus_mem, clus_mem.T)

res = pd.DataFrame(curr_mat, index=test.index, columns=test.index)

With the result:

         index 0  index 1  index 2  index 3
index 0     4        3        1        0
index 1     3        4        1        0
index 2     1        1        4        1
index 3     0        0        1        4

Unfortunately I had to use a for loop but the number of iterations is now only the number of clusters and I take advantage of numpy's efficient array operations.

Upvotes: 1

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

Here is a solution that requires looping over columns only.

res = sum(df[c].transform(lambda x: x == df[c]) for c in df.columns)

However, if your data is fairly sparse, using loops or graphs may end up being faster.

Upvotes: 1

Related Questions