Reputation: 455
I have the following data frame
df = pd.DataFrame([[1990,7,1000],[1990,8,2500],[1990,9,2500],[1990,9,1500],[1991,1,250],[1991,2,350],[1991,3,350],[1991,7,450]], columns = ['year','month','data1'])
year month data1
1990 7 1000
1990 8 2500
1990 9 2500
1990 9 1500
1991 1 250
1991 2 350
1991 3 350
1991 7 450
I would like to filter the data such that it won't contain data with month/year 07/1990, 08/1990 and 01/1991. I can do for each combination month/year as follow:
df = df.loc[(df.year != 1990) | (df.month != 7)]
But it is not efficient if there are many combinations month/year. Is there any more efficient way of doing this?
Many thanks.
Upvotes: 11
Views: 2247
Reputation: 28729
Set the year
and month
as indices, filter on the indices with isin
and a callable
and reset the index:
(
df.set_index(["year", "month"])
.loc[lambda x: ~x.index.isin([(1990, 7), (1990, 8), (1991, 1)])]
.reset_index()
)
year month data1
0 1990 9 2500
1 1990 9 1500
2 1991 2 350
3 1991 3 350
4 1991 7 450
Alternatively, you can use the drop
function to drop the rows you are not interested in, and reset the index:
(df.set_index(["year", "month"], drop = False)
.drop([(1990, 7), (1990, 8), (1991, 1)])
.reset_index(drop = True)
)
Upvotes: 2
Reputation: 784
[EDIT] just iterate over your df and get all indexes to delete:
dates_to_avoid = [(1990, 7), (1990, 8), (1991, 1)])
index_to_delete = [row.index for row in df.itertuples() if (row.year, row.month) in dates_to_avoid]
Then:
df = df.loc[~df.index.isin(index_to_delete)]
Here the keyword ~
will avoid all values from the list.
Upvotes: 2
Reputation: 26311
Even faster (roughly 3x than the elegant version of @DaniMesejo applying tuple
). But also it relies on the knowledge that months are bounded to (well below) 100, so less generalizable:
mask = ~(df.year*100 + df.month).isin({199007, 199008, 199101})
df[mask]
# out:
year month data1
2 1990 9 2500
3 1990 9 1500
5 1991 2 350
6 1991 3 350
7 1991 7 450
How come this is 3x faster than the tuples solution? (Tricks for speed):
apply
..isin()
with a set as argument (not a list).Upvotes: 10
Reputation: 323396
Let us try merge
out = df.drop(df.reset_index().merge(pd.DataFrame({'year':[1990,1990,1991],'month':[7,8,1]}))['index'])
year month data1
2 1990 9 2500
3 1990 9 1500
5 1991 2 350
6 1991 3 350
7 1991 7 450
And small improvement
out = df.merge(pd.DataFrame({'year':[1990,1990,1991],'month':[7,8,1]}),indicator=True,how='left').loc[lambda x : x['_merge']=='left_only']
year month data1 _merge
2 1990 9 2500 left_only
3 1990 9 1500 left_only
5 1991 2 350 left_only
6 1991 3 350 left_only
7 1991 7 450 left_only
Based on my test this should be fast than apply tuple method ~
Upvotes: 3
Reputation: 2706
You can add a value for yyyymm and then use this to remove the data you want.
df['yyyymm'] = df['year'].astype(str) + df['month'].astype(str).zfill(2)
df = df.loc[(df.yyyymm != '199007') & (df.yyyymm != '199008') & (df.yyyymm != '199101')]
Upvotes: 2
Reputation: 820
You can use apply
function
def filter(row):
# Add other conditions and just return row accordingly
if ((row.year != 1990) | (row.month != 7)):
return True
return False
mask = df.apply(filter,axis=1)
df[mask]
Upvotes: 1
Reputation: 61930
You could do:
mask = ~df[['year', 'month']].apply(tuple, 1).isin([(1990, 7), (1990, 8), (1991, 1)])
print(df[mask])
Output
year month data1
2 1990 9 2500
3 1990 9 1500
5 1991 2 350
6 1991 3 350
7 1991 7 450
Upvotes: 11