Reputation: 119
This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?
I have two dataframes A
and B
:
A
is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
For a condensed example:
A B C D E
0 0 0 0 1 0
1 1 1 1 1 0
2 1 0 0 1 1
3 0 1 1 1 0
B
is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.
Example:
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I am trying to find which rows in A
, at a particular 10 columns of A
, correspond with each row of B
.
Each row of A[My_Columns_List]
is guaranteed to be somewhere in B
, but not every row of B
will match up with a row in A[My_Columns_List]
For example, I want to show that for columns [B,D,E]
of A
,
rows [1,3]
of A match up with row [6]
of B
,
row [0]
of A matches up with row [2]
of B
,
row [2]
of A matches up with row [3]
of B
.
I have tried using:
pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(),
suffixes = ('_B','_A')))
This works, but I was hoping that this method would be faster:
S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
But when I do this, out_row_idx
returns an array containing all the indices of A
, which doesn't tell me anything.
I think this method will be faster, but I don't know why it returns an array from 0 to 999.
Any input would be appreciated!
Also, credit goes to @jezrael and @Divakar for these methods.
Upvotes: 4
Views: 1609
Reputation: 294218
I'll stick by my initial answer but maybe explain better.
You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.
You said we have a a 1000 x 500 array of ones and zeros. Let's build that.
A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init
In addition, I gave A
a MultiIndex
to easily group by columns of 10.
This is very similar to @Divakar's answer with one minor difference that I'll point out.
For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.
twos = 2 ** np.arange(10)
I can execute this for every group of 10 ones and zeros in one go like this
AtB = A.stack(0).dot(twos).unstack()
I stack
to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack
.
I now have a 1000 x 50 dataframe of numbers that range from 0-1023.
Assume B
is a dataframe with each row one of 1024 unique combinations of ones and zeros. B
should be sorted like B = B.sort_values().reset_index(drop=True)
.
This is the part I think I failed at explaining last time. Look at
AtB.loc[:2, :2]
That value in the (0, 0)
position, 951
means that the first group of 10 ones and zeros in the first row of A
matches the row in B
with the index 951
. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B
. Ignoring this useless step should save time.
These are all functions that take two dataframes A
and B
and returns a dataframe of indices where A
matches B
. Spoiler alert, I'll ignore B
completely.
def FindAinB(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
twos = 2 ** np.arange(10)
return A.stack(0).dot(twos).unstack()
def FindAinB2(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
# use clever bit shifting instead of dot product with powers
# questionable improvement
return (A.stack(0) << np.arange(10)).sum(1).unstack()
I'm channelling my inner @Divakar (read, this is stuff I've learned from Divakar)
def FindAinB3(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
a = A.values.reshape(-1, 10)
a = np.einsum('ij->i', a << np.arange(10))
return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)
f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)
Use it like
f(A)
FindAinB3 is an order of magnitude faster
Upvotes: 6