Garrett Miller
Garrett Miller

Reputation: 119

Comparing rows of two pandas dataframes?

This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?

I have two dataframes A and B:

A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.

For a condensed example:

    A   B   C   D   E  
0   0   0   0   1   0  
1   1   1   1   1   0  
2   1   0   0   1   1  
3   0   1   1   1   0  

B is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.

Example:

     0   1   2  
0    0   0   0  
1    0   0   1   
2    0   1   0  
3    0   1   1  
4    1   0   0
5    1   0   1
6    1   1   0 
7    1   1   1

I am trying to find which rows in A, at a particular 10 columns of A, correspond with each row of B.

Each row of A[My_Columns_List] is guaranteed to be somewhere in B, but not every row of B will match up with a row in A[My_Columns_List]

For example, I want to show that for columns [B,D,E] of A,

rows [1,3] of A match up with row [6] of B,

row [0] of A matches up with row [2] of B,

row [2] of A matches up with row [3] of B.

I have tried using:

pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(), 
suffixes = ('_B','_A')))

This works, but I was hoping that this method would be faster:

S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]

But when I do this, out_row_idx returns an array containing all the indices of A, which doesn't tell me anything. I think this method will be faster, but I don't know why it returns an array from 0 to 999. Any input would be appreciated!

Also, credit goes to @jezrael and @Divakar for these methods.

Upvotes: 4

Views: 1609

Answers (1)

piRSquared
piRSquared

Reputation: 294218

I'll stick by my initial answer but maybe explain better.

You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.

Setup

You said we have a a 1000 x 500 array of ones and zeros. Let's build that.

A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init

In addition, I gave A a MultiIndex to easily group by columns of 10.

Solution

This is very similar to @Divakar's answer with one minor difference that I'll point out.

For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.

twos = 2 ** np.arange(10)

I can execute this for every group of 10 ones and zeros in one go like this

AtB = A.stack(0).dot(twos).unstack()

I stack to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack.

I now have a 1000 x 50 dataframe of numbers that range from 0-1023.

Assume B is a dataframe with each row one of 1024 unique combinations of ones and zeros. B should be sorted like B = B.sort_values().reset_index(drop=True).

This is the part I think I failed at explaining last time. Look at

AtB.loc[:2, :2]

enter image description here

That value in the (0, 0) position, 951 means that the first group of 10 ones and zeros in the first row of A matches the row in B with the index 951. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B. Ignoring this useless step should save time.

These are all functions that take two dataframes A and B and returns a dataframe of indices where A matches B. Spoiler alert, I'll ignore B completely.

def FindAinB(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    rng = np.arange(A.shape[1])
    A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])

    twos = 2 ** np.arange(10)

    return A.stack(0).dot(twos).unstack()

def FindAinB2(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    rng = np.arange(A.shape[1])
    A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
    # use clever bit shifting instead of dot product with powers
    # questionable improvement
    return (A.stack(0) << np.arange(10)).sum(1).unstack()

I'm channelling my inner @Divakar (read, this is stuff I've learned from Divakar)

def FindAinB3(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    a = A.values.reshape(-1, 10)
    a = np.einsum('ij->i', a << np.arange(10))
    return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)

Minimalist One Liner

f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)

Use it like

f(A)

Timing

FindAinB3 is an order of magnitude faster

enter image description here

Upvotes: 6

Related Questions