Reputation: 5347
I have received a pandas dataframe. It is full of unnecessary features that I would like to remove. Right now I am doing the following, which is dirty How could I get this in a more pythonic way?
features_to_include= mydf.columns.tolist()
features_to_include=[f for f in features_to_include if 'stopword1' not in f]
features_to_include=[f for f in features_to_include if 'stopwordN' not in f]
[... other 90 of those]
features_to_include=[f for f in features_to_include if 'password1' in f]
features_to_include=[f for f in features_to_include if 'passwordN' in f]
[... other 90 of those]
EDIT:'stopword1' and 'password1' are not in X.columns
an example name of a X.columns
could be: feature99_stopword1
Upvotes: 1
Views: 103
Reputation: 13255
You can try using filter
:
df.filter(regex='password|stopword1', axis=1)
Or if we have a list:
cols = ['password','passwordN','stopword1','stopwordN']
mydf.filter(regex='|'.join(cols), axis=1)
Upvotes: 1
Reputation: 863226
I think need str.contains
:
L = ['stopword1','stopwordN','password1', 'passwordN']
#thanks roganjosh for suggestion
L = set(['stopword1','stopwordN','password1', 'passwordN'])
mydf = mydf.loc[:, mydf.columns.str.contains('|'.join(L))]
Sample:
mydf = pd.DataFrame({'feature99_stopword1':list('abcdef'),
'feature99_stopword':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'd_stopword1':[1,3,5,7,1,0],
'password1':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (mydf)
feature99_stopword1 feature99_stopword C d_stopword1 password1 F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
L = ['stopword1','stopwordN','password1', 'passwordN']
mydf = mydf.loc[:, mydf.columns.str.contains('|'.join(L))]
print (mydf)
feature99_stopword1 d_stopword1 password1
0 a 1 5
1 b 3 3
2 c 5 6
3 d 7 9
4 e 1 2
5 f 0 4
Upvotes: 2