Reputation: 8145

Drop columns whose name contains a specific string from pandas DataFrame

I have a pandas dataframe with the following column names:

Result1, Test1, Result2, Test2, Result3, Test3, etc...

I want to drop all the columns whose name contains the word "Test". The numbers of such columns is not static but depends on a previous function.

How can I do that?

Upvotes: 238

Answers (13)

jetpeach

Reputation: 31

I do not recommend using the 'filter' method, because it returns the entire dataframe and not good for larger datasets.

Instead, pandas provides regex filtering of columns using str.match:

df.columns.str.match('.*Test.*')
# array([ True, False, False, False])

(this will return boolean array for 'Test' anywhere in the column names, not just at the start)

Use .loc to designate the columns using the boolean array. Note that '~' inverts the boolean array, since we want to drop (not keep) all those columns that contain 'Test'

df = df.loc[:, ~df.columns.str.match('.*Test.*')]

In this way, only the columns names are needed for the filtering and we never need to return a copy of filtered data. Note there are other str methods that can be done on the column names, like startwith, endswith, but match provides the power of regex so most universal.

Upvotes: 3

tef2128

Reputation: 780

Building on my preferred answer by @cs95, combining loc with a lambda function enables a nice clean pipe chain like this:

output_df = (
    input_df
    .stuff
    .more_stuff
    .yet_more_stuff
    .loc[:, lambda x: ~x.columns.str.startswith('Test')]
)

This way you can refer to columns of the dataframe produced by pd.DataFrame.yet_more_stuff, rather than the original dataframe input_df itself, as the columns may have changed (depending, of course, on all the stuff).

Upvotes: 0

ZacNt

Reputation: 49

You can use df.filter to get the list of columns that match your string and then use df.drop

resdf = df.drop(df.filter(like='Test',axis=1).columns.to_list(), axis=1)

Upvotes: 4

Roy Assis

Reputation: 171

Using a regex to match all columns not containing the unwanted word:

df = df.filter(regex='^((?!badword).)*$')

Upvotes: 13

Marvasti

Reputation: 79

Question states 'I want to drop all the columns whose name contains the word "Test".'

test_columns = [col for col in df if 'Test' in col]
df.drop(columns=test_columns, inplace=True)

Upvotes: 8

Bindiya12

Reputation: 3491

Here is one way to do this:

df = df[df.columns.drop(list(df.filter(regex='Test')))]

Upvotes: 331

BSalita

Reputation: 8931

Solution when dropping a list of column names containing regex. I prefer this approach because I'm frequently editing the drop list. Uses a negative filter regex for the drop list.

drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)

Upvotes: 1

winderland

Reputation: 518

This method does everything in place. Many of the other answers create copies and are not as efficient:

df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)

Upvotes: 9

cs95

Reputation: 402413

Cheaper, Faster, and Idiomatic: `str.contains`

In recent versions of pandas, you can use string methods on the index and columns. Here, str.startswith seems like a good fit.

To remove all columns starting with a given substring:

df.columns.str.startswith('Test')
# array([ True, False, False, False])

df.loc[:,~df.columns.str.startswith('Test')]

  toto test2 riri
0    x     x    x
1    x     x    x

For case-insensitive matching, you can use regex-based matching with str.contains with an SOL anchor:

df.columns.str.contains('^test', case=False)
# array([ True, False,  True, False])

df.loc[:,~df.columns.str.contains('^test', case=False)] 

  toto riri
0    x    x
1    x    x

if mixed-types is a possibility, specify na=False as well.

Upvotes: 137

Warren O'Neill

Reputation: 728

This can be done neatly in one line with:

df = df.drop(df.filter(regex='Test').columns, axis=1)

Upvotes: 48

SAH

Reputation: 269

You can filter out the columns you DO want using 'filter'

import pandas as pd
import numpy as np

data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]

df = pd.DataFrame(data2)

df

    c   result1     result34    test    test2
0   NaN     2.0     NaN     NaN     1.0
1   20.0    NaN     10.0    5.0     NaN

Now filter

df.filter(like='result',axis=1)

Get..

   result1  result34
0   2.0     NaN
1   NaN     10.0

Upvotes: 22

Phillip Cloud

Reputation: 25662

Use the DataFrame.select method:

In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})

In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
   awesome
0    1.215
1    1.247
2    0.142
3    0.169
4    0.137
5   -0.971
6    0.736
7    0.214
8    0.111
9   -0.214

Upvotes: 11

Nic

Reputation: 3507

import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

      Test1      toto     test2      riri
0  0.923249  0.572528  0.845464  0.144891
1  0.020438  0.332540  0.144455  0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
       toto      riri
0  0.572528  0.144891
1  0.332540  0.741412

Upvotes: 121

Drop columns whose name contains a specific string from pandas DataFrame

Answers (13)

Cheaper, Faster, and Idiomatic: str.contains

Related Questions

Cheaper, Faster, and Idiomatic: `str.contains`