teteh May
teteh May

Reputation: 455

Pandas Data Frame Filtering Multiple Conditions

I have the following data frame

df = pd.DataFrame([[1990,7,1000],[1990,8,2500],[1990,9,2500],[1990,9,1500],[1991,1,250],[1991,2,350],[1991,3,350],[1991,7,450]], columns = ['year','month','data1'])

year    month    data1
1990      7      1000
1990      8      2500
1990      9      2500
1990      9      1500
1991      1      250
1991      2      350
1991      3      350
1991      7      450

I would like to filter the data such that it won't contain data with month/year 07/1990, 08/1990 and 01/1991. I can do for each combination month/year as follow:

df = df.loc[(df.year != 1990) | (df.month != 7)]

But it is not efficient if there are many combinations month/year. Is there any more efficient way of doing this?

Many thanks.

Upvotes: 11

Views: 2247

Answers (7)

sammywemmy
sammywemmy

Reputation: 28729

Set the year and month as indices, filter on the indices with isin and a callable and reset the index:

(
    df.set_index(["year", "month"])
    .loc[lambda x: ~x.index.isin([(1990, 7), (1990, 8), (1991, 1)])]
    .reset_index()
)


year    month   data1
0   1990    9   2500
1   1990    9   1500
2   1991    2   350
3   1991    3   350
4   1991    7   450

Alternatively, you can use the drop function to drop the rows you are not interested in, and reset the index:

(df.set_index(["year", "month"], drop = False)
 .drop([(1990, 7), (1990, 8), (1991, 1)])
 .reset_index(drop = True)
  )

Upvotes: 2

ljuk
ljuk

Reputation: 784

[EDIT] just iterate over your df and get all indexes to delete:

dates_to_avoid = [(1990, 7), (1990, 8), (1991, 1)])
index_to_delete = [row.index for row in df.itertuples() if (row.year, row.month) in dates_to_avoid]
   

Then:

df = df.loc[~df.index.isin(index_to_delete)]

Here the keyword ~ will avoid all values from the list.

Upvotes: 2

Pierre D
Pierre D

Reputation: 26311

Even faster (roughly 3x than the elegant version of @DaniMesejo applying tuple). But also it relies on the knowledge that months are bounded to (well below) 100, so less generalizable:

mask = ~(df.year*100 + df.month).isin({199007, 199008, 199101})
df[mask]

# out:
   year  month  data1
2  1990      9   2500
3  1990      9   1500
5  1991      2    350
6  1991      3    350
7  1991      7    450

How come this is 3x faster than the tuples solution? (Tricks for speed):

  1. All vectorized operations and no apply.
  2. No string operations, all ints.
  3. Using .isin() with a set as argument (not a list).

Upvotes: 10

BENY
BENY

Reputation: 323396

Let us try merge

out = df.drop(df.reset_index().merge(pd.DataFrame({'year':[1990,1990,1991],'month':[7,8,1]}))['index'])
   year  month  data1
2  1990      9   2500
3  1990      9   1500
5  1991      2    350
6  1991      3    350
7  1991      7    450

And small improvement

out = df.merge(pd.DataFrame({'year':[1990,1990,1991],'month':[7,8,1]}),indicator=True,how='left').loc[lambda x : x['_merge']=='left_only']
   year  month  data1     _merge
2  1990      9   2500  left_only
3  1990      9   1500  left_only
5  1991      2    350  left_only
6  1991      3    350  left_only
7  1991      7    450  left_only

Based on my test this should be fast than apply tuple method ~

Upvotes: 3

Paul Brennan
Paul Brennan

Reputation: 2706

You can add a value for yyyymm and then use this to remove the data you want.

df['yyyymm'] = df['year'].astype(str) + df['month'].astype(str).zfill(2)
df = df.loc[(df.yyyymm != '199007') & (df.yyyymm != '199008') & (df.yyyymm != '199101')]

Upvotes: 2

k33da_the_bug
k33da_the_bug

Reputation: 820

You can use apply function

def filter(row):
  # Add other conditions and just return row accordingly
  if ((row.year != 1990) | (row.month != 7)):
    return True
  return False

mask = df.apply(filter,axis=1)
df[mask]

Upvotes: 1

Dani Mesejo
Dani Mesejo

Reputation: 61930

You could do:

mask = ~df[['year', 'month']].apply(tuple, 1).isin([(1990, 7), (1990, 8), (1991, 1)])
print(df[mask])

Output

   year  month  data1
2  1990      9   2500
3  1990      9   1500
5  1991      2    350
6  1991      3    350
7  1991      7    450

Upvotes: 11

Related Questions