The Great
The Great

Reputation: 7713

Use regex to remove/exclude columns from dataframe - Python

I have a dataframe which can be generated from the code below

    df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})

The dataframe looks like as shown below

enter image description here

I would like to remove columns that contain "derived" in their name. I tried different regex but couldn't get the expected output.

    df = df.filter(regex='[^H\dDerived]+', axis=1)
    df = df.filter(regex='[^Derived]',axis=1)

Can you let me know the right regex to do this?

Upvotes: 1

Views: 2447

Answers (5)

bharatk
bharatk

Reputation: 4315

In recent versions of pandas, you can use string methods on the index and columns. Here, str.endswith seems like a good fit.

import pandas as pd

df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],
                   'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],
                   'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],
                   'date3derived':[0,0,0],'val3':[7,9,11]})

df = df.loc[:,~df.columns.str.endswith('derived')]

print(df)

O/P:

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

Upvotes: 1

anky
anky

Reputation: 75080

pd.Index.difference() with df.filter()

df[df.columns.difference(df.filter(like='derived').columns,sort=False)]

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

Upvotes: 1

Andy L.
Andy L.

Reputation: 25239

IIUC, you want to drop columns has derived in it. This should do:

df.drop(df.filter(like='derived').columns, 1)

Out[455]:
   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

Upvotes: 2

heemayl
heemayl

Reputation: 42017

You can use a zero-width negative lookahead to make sure the string derived does not come anywhere:

^(?!.*?derived)
  • ^ matches the start of the string
  • (?!.*?derived) is the negative lookahead pattern that makes sure derived does not come in the string

Your pattern [^Derived] will match any single character that are not one of D/e/r/i/v/e/d .

Upvotes: 2

iamklaus
iamklaus

Reputation: 3770

df[[c for c in df.columns if 'derived' not in c ]]

Output

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

Upvotes: 1

Related Questions