Reputation: 6709
I have a pandas dataframe that looks something like this:
Al01 BBR60 CA07 NL219 AAEAMEVAT MP NaN MP MP AAFEDLRLL NaN NaN NaN NaN AAGAAVKGV NP NaN NP NP ADRGLLRDI NaN NP NaN NaN AEIMKICST PB1 NaN NaN PB1 AFDERRAGK NaN NaN NP NP AFDERRAGK NP NaN NaN NaN
There are a thousand or so rows and half a dozen columns. Most cells are empty (NaN). I would like to know what the probability of text in each column is, given that a different column has text in it. For example, the little snippet here would produce something like this:
Al01 BBR60 CA07 NL219 Al01 4 0 2 3 BBR60 0 1 0 0 CA07 2 0 3 3 NL219 3 0 3 4
That says that there are 4 hits in the Al01 column; of those 4 hits, none are hits in the BBR60 column, 2 are also hits in the CA07 column, and 3 are hits in the NL219 column. And so on.
I can step through each column and build a dict with the values, but that seems clumsy. Is there a simpler approach?
Upvotes: 3
Views: 3648
Reputation: 880509
The operation you are performing can be expressed as an application of np.einsum
-- it's an inner product between each pair of columns:
import numpy as np
import pandas as pd
df = pd.read_table('data', sep='\s+')
print(df)
# Al01 BBR60 CA07 NL219
# 0 MP NaN MP MP
# 1 NaN NaN NaN NaN
# 2 NP NaN NP NP
# 3 NaN NP NaN NaN
# 4 PB1 NaN NaN PB1
# 5 NaN NaN NP NP
# 6 NP NaN NaN NaN
arr = (~df.isnull()).values.astype('int')
print(arr)
# [[1 0 1 1]
# [0 0 0 0]
# [1 0 1 1]
# [0 1 0 0]
# [1 0 0 1]
# [0 0 1 1]
# [1 0 0 0]]
result = pd.DataFrame(np.einsum('ij,ik', arr, arr),
columns=df.columns, index=df.columns)
print(result)
yields
Al01 BBR60 CA07 NL219
Al01 4 0 2 3
BBR60 0 1 0 0
CA07 2 0 3 3
NL219 3 0 3 4
Usually when a calculation boils down to a numeric operation independent of indices, it is faster to do it with NumPy than with Pandas. That appears to be the case here:
In [130]: %timeit df2 = df.applymap(lambda x: int(not pd.isnull(x))); df2.T.dot(df2)
1000 loops, best of 3: 1.12 ms per loop
In [132]: %timeit arr = (~df.isnull()).values.astype('int'); pd.DataFrame(np.einsum('ij,ik', arr, arr), columns=df.columns, index=df.columns)
10000 loops, best of 3: 132 µs per loop
Upvotes: 6
Reputation: 17475
It just matrix multiplication:
import pandas as pd
df = pd.read_csv('data.csv',index_col=0, delim_whitespace=True)
df2 = df.applymap(lambda x: int(not pd.isnull(x)))
print df2.T.dot(df2)
Output:
Al01 BBR60 CA07 NL219
Al01 4 0 2 3
BBR60 0 1 0 0
CA07 2 0 3 3
NL219 3 0 3 4
[4 rows x 4 columns]
Upvotes: 3