user3276418
user3276418

Reputation: 1807

FIltering Pandas Dataframe using vectorization

I have a data frame with x rows and y colums, called df. I have another datafame df2 with less than x rows and y-1 colums. I want to filter df for rows that are identical with the rows of df2 from column 1 to y-1. Is there a way to do that in a vectorized fashion without iterating through the rows of df2?

Here is the code for a sample df:

import pandas
import numpy.random as rd
dates = pandas.date_range('1/1/2000', periods=8)
df = pandas.DataFrame(rd.randn(8, 5), index=dates, columns=['call/put', 'expiration', 'strike', 'ask', 'bid'])
df.iloc[2,4]=0
df.iloc[2,3]=0
df.iloc[3,4]=0
df.iloc[3,3]=0
df.iloc[2,2]=0.5
df=df.append(df.iloc[2:3])
df.iloc[8:9,3:5]=1
df.iloc[8:9,2:3]=0.6
df=df.append(df.iloc[8:9])
df.iloc[9,2]=0.4

df2 is calculated as follows:

df4=df[(df["ask"]==0) & (df["bid"]==0)]

Now I want to filter df for rows that look like those in df2 except column strike, which should have a value of 0.4. Filter process should be without iteration, because my real world df is very large.

Upvotes: 1

Views: 464

Answers (1)

Thomas Kimber
Thomas Kimber

Reputation: 11097

You try do a merge on both dataframes, which should return the (set) intersection of both.

pandas.merge (df,df2,on=['call/put','expiration','strike','ask'],left_index=True,right_index=True)


            call/put  expiration    strike  ask  bid_x  bid_y
2000-01-03  0.614738   -0.363933  0.500000    0      0      0
2000-01-03  0.614738   -0.363933  0.600000    1      1      0
2000-01-03  0.614738   -0.363933  0.400000    1      1      0
2000-01-04  1.077427   -1.046127  0.025931    0      0      0

I renamed your df4 to df2 - The dataframe returned above should be the complete list of records from df that match those records in your "whitelist" contained within df2, based on the columns listed in the statement above.

A slightly different statement, drops 'strike' and adds 'bid' into the columns to be matched on and returns:

pandas.merge (df,df2,on=['call/put','expiration','ask','bid'],left_index=True,right_index=True,how='inner')
            call/put  expiration  strike_x  ask  bid  strike_y
2000-01-03  0.614738   -0.363933  0.500000    0    0  0.500000
2000-01-03  0.614738   -0.363933  0.600000    1    1  0.500000
2000-01-03  0.614738   -0.363933  0.400000    1    1  0.500000
2000-01-04  1.077427   -1.046127  0.025931    0    0  0.025931

That's still not quite right - I think it's because of the index=True parts. To force it, you can convert the date-indices into regular columns, and include them as match columns.

e.g.

df['date'] = df.index
df2['date'] = df2.index

And then

pandas.merge (df,df2,on=['call/put','expiration','ask','bid','date'],how='inner')

Returns:

    call/put  expiration  strike_x  ask  bid                date  strike_y
 0  0.367269   -0.616125   0.50000    0    0 2000-01-03 00:00:00   0.50000
 1 -0.508974    0.281017   0.65791    0    0 2000-01-04 00:00:00   0.65791

Which I think more closely matches what you're looking for.

Upvotes: 1

Related Questions