Reputation: 8145
I have a pandas dataframe with the following column names:
Result1, Test1, Result2, Test2, Result3, Test3, etc...
I want to drop all the columns whose name contains the word "Test". The numbers of such columns is not static but depends on a previous function.
How can I do that?
Upvotes: 238
Views: 306651
Reputation: 31
I do not recommend using the 'filter' method, because it returns the entire dataframe and not good for larger datasets.
Instead, pandas provides regex filtering of columns using str.match:
df.columns.str.match('.*Test.*')
# array([ True, False, False, False])
(this will return boolean array for 'Test' anywhere in the column names, not just at the start)
Use .loc to designate the columns using the boolean array. Note that '~' inverts the boolean array, since we want to drop (not keep) all those columns that contain 'Test'
df = df.loc[:, ~df.columns.str.match('.*Test.*')]
In this way, only the columns names are needed for the filtering and we never need to return a copy of filtered data. Note there are other str methods that can be done on the column names, like startwith, endswith, but match provides the power of regex so most universal.
Upvotes: 3
Reputation: 780
Building on my preferred answer by @cs95, combining loc
with a lambda function enables a nice clean pipe chain like this:
output_df = (
input_df
.stuff
.more_stuff
.yet_more_stuff
.loc[:, lambda x: ~x.columns.str.startswith('Test')]
)
This way you can refer to columns of the dataframe produced by pd.DataFrame.yet_more_stuff
, rather than the original dataframe input_df
itself, as the columns may have changed (depending, of course, on all the stuff).
Upvotes: 0
Reputation: 49
You can use df.filter
to get the list of columns that match your string and then use df.drop
resdf = df.drop(df.filter(like='Test',axis=1).columns.to_list(), axis=1)
Upvotes: 4
Reputation: 171
Using a regex to match all columns not containing the unwanted word:
df = df.filter(regex='^((?!badword).)*$')
Upvotes: 13
Reputation: 79
Question states 'I want to drop all the columns whose name contains the word "Test".'
test_columns = [col for col in df if 'Test' in col]
df.drop(columns=test_columns, inplace=True)
Upvotes: 8
Reputation: 3491
Here is one way to do this:
df = df[df.columns.drop(list(df.filter(regex='Test')))]
Upvotes: 331
Reputation: 8931
Solution when dropping a list of column names containing regex. I prefer this approach because I'm frequently editing the drop list. Uses a negative filter regex for the drop list.
drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)
Upvotes: 1
Reputation: 518
This method does everything in place. Many of the other answers create copies and are not as efficient:
df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)
Upvotes: 9
Reputation: 402413
str.contains
In recent versions of pandas, you can use string methods on the index and columns. Here, str.startswith
seems like a good fit.
To remove all columns starting with a given substring:
df.columns.str.startswith('Test')
# array([ True, False, False, False])
df.loc[:,~df.columns.str.startswith('Test')]
toto test2 riri
0 x x x
1 x x x
For case-insensitive matching, you can use regex-based matching with str.contains
with an SOL anchor:
df.columns.str.contains('^test', case=False)
# array([ True, False, True, False])
df.loc[:,~df.columns.str.contains('^test', case=False)]
toto riri
0 x x
1 x x
if mixed-types is a possibility, specify na=False
as well.
Upvotes: 137
Reputation: 728
This can be done neatly in one line with:
df = df.drop(df.filter(regex='Test').columns, axis=1)
Upvotes: 48
Reputation: 269
You can filter out the columns you DO want using 'filter'
import pandas as pd
import numpy as np
data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]
df = pd.DataFrame(data2)
df
c result1 result34 test test2
0 NaN 2.0 NaN NaN 1.0
1 20.0 NaN 10.0 5.0 NaN
Now filter
df.filter(like='result',axis=1)
Get..
result1 result34
0 2.0 NaN
1 NaN 10.0
Upvotes: 22
Reputation: 25662
Use the DataFrame.select
method:
In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})
In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
awesome
0 1.215
1 1.247
2 0.142
3 0.169
4 0.137
5 -0.971
6 0.736
7 0.214
8 0.111
9 -0.214
Upvotes: 11
Reputation: 3507
import pandas as pd
import numpy as np
array=np.random.random((2,4))
df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))
print df
Test1 toto test2 riri
0 0.923249 0.572528 0.845464 0.144891
1 0.020438 0.332540 0.144455 0.741412
cols = [c for c in df.columns if c.lower()[:4] != 'test']
df=df[cols]
print df
toto riri
0 0.572528 0.144891
1 0.332540 0.741412
Upvotes: 121