Reputation: 1807
I have a data frame with x rows and y colums, called df. I have another datafame df2 with less than x rows and y-1 colums. I want to filter df for rows that are identical with the rows of df2 from column 1 to y-1. Is there a way to do that in a vectorized fashion without iterating through the rows of df2?
Here is the code for a sample df:
import pandas
import numpy.random as rd
dates = pandas.date_range('1/1/2000', periods=8)
df = pandas.DataFrame(rd.randn(8, 5), index=dates, columns=['call/put', 'expiration', 'strike', 'ask', 'bid'])
df.iloc[2,4]=0
df.iloc[2,3]=0
df.iloc[3,4]=0
df.iloc[3,3]=0
df.iloc[2,2]=0.5
df=df.append(df.iloc[2:3])
df.iloc[8:9,3:5]=1
df.iloc[8:9,2:3]=0.6
df=df.append(df.iloc[8:9])
df.iloc[9,2]=0.4
df2 is calculated as follows:
df4=df[(df["ask"]==0) & (df["bid"]==0)]
Now I want to filter df for rows that look like those in df2 except column strike, which should have a value of 0.4. Filter process should be without iteration, because my real world df is very large.
Upvotes: 1
Views: 464
Reputation: 11097
You try do a merge on both dataframes, which should return the (set) intersection of both.
pandas.merge (df,df2,on=['call/put','expiration','strike','ask'],left_index=True,right_index=True)
call/put expiration strike ask bid_x bid_y
2000-01-03 0.614738 -0.363933 0.500000 0 0 0
2000-01-03 0.614738 -0.363933 0.600000 1 1 0
2000-01-03 0.614738 -0.363933 0.400000 1 1 0
2000-01-04 1.077427 -1.046127 0.025931 0 0 0
I renamed your df4 to df2 - The dataframe returned above should be the complete list of records from df that match those records in your "whitelist" contained within df2, based on the columns listed in the statement above.
A slightly different statement, drops 'strike' and adds 'bid' into the columns to be matched on and returns:
pandas.merge (df,df2,on=['call/put','expiration','ask','bid'],left_index=True,right_index=True,how='inner')
call/put expiration strike_x ask bid strike_y
2000-01-03 0.614738 -0.363933 0.500000 0 0 0.500000
2000-01-03 0.614738 -0.363933 0.600000 1 1 0.500000
2000-01-03 0.614738 -0.363933 0.400000 1 1 0.500000
2000-01-04 1.077427 -1.046127 0.025931 0 0 0.025931
That's still not quite right - I think it's because of the index=True parts. To force it, you can convert the date-indices into regular columns, and include them as match columns.
e.g.
df['date'] = df.index
df2['date'] = df2.index
And then
pandas.merge (df,df2,on=['call/put','expiration','ask','bid','date'],how='inner')
Returns:
call/put expiration strike_x ask bid date strike_y
0 0.367269 -0.616125 0.50000 0 0 2000-01-03 00:00:00 0.50000
1 -0.508974 0.281017 0.65791 0 0 2000-01-04 00:00:00 0.65791
Which I think more closely matches what you're looking for.
Upvotes: 1