Pairwise matrix from a pandas dataframe

Question

I have a pandas dataframe that looks something like this:

             Al01   BBR60   CA07    NL219
AAEAMEVAT    MP      NaN     MP      MP 
AAFEDLRLL    NaN     NaN     NaN     NaN
AAGAAVKGV    NP      NaN     NP      NP 
ADRGLLRDI    NaN     NP      NaN     NaN 
AEIMKICST    PB1     NaN     NaN     PB1 
AFDERRAGK    NaN     NaN     NP      NP 
AFDERRAGK    NP      NaN     NaN     NaN

There are a thousand or so rows and half a dozen columns. Most cells are empty (NaN). I would like to know what the probability of text in each column is, given that a different column has text in it. For example, the little snippet here would produce something like this:

            Al01    BBR60   CA07    NL219
Al01        4       0       2       3
BBR60       0       1       0       0
CA07        2       0       3       3
NL219       3       0       3       4

That says that there are 4 hits in the Al01 column; of those 4 hits, none are hits in the BBR60 column, 2 are also hits in the CA07 column, and 3 are hits in the NL219 column. And so on.

I can step through each column and build a dict with the values, but that seems clumsy. Is there a simpler approach?

Alvaro Fuentes · Accepted Answer

It just matrix multiplication:

import pandas as pd
df = pd.read_csv('data.csv',index_col=0, delim_whitespace=True)
df2 = df.applymap(lambda x: int(not pd.isnull(x)))
print df2.T.dot(df2)

Output:

           Al01  BBR60  CA07  NL219
Al01      4      0     2      3
BBR60     0      1     0      0
CA07      2      0     3      3
NL219     3      0     3      4

[4 rows x 4 columns]

Pairwise matrix from a pandas dataframe

Answers (2)

Related Questions