Reputation: 125
I have a dataframe with columns that follow certain naming convention. I want to keep only those that have 'out' and 'agg' as prefixes in the header.
I've drafted the following code to achieve this. I created a list so that I can make this a small function and call it for any combination of col prefixes that I want to extract.
prefix = ['out', 'agg']
cols = []
for pref in prefix:
cols = cols + [col for col in df.columns if pref in col]
df = df[cols].dropna(how='all', axis=0)
Is there a shorter/faster way to do this? I liked the solutions here:Drop columns whose name contains a specific string from pandas DataFrame but couldn't make them work for a list of strings.
thanks
Upvotes: 1
Views: 1320
Reputation: 862481
Use DataFrame.filter
with regex for match columns names by strings joined by |
for regex or
:
df = pd.DataFrame({
'A_out':list('abcdef'),
'B_out':[4,5,4,5,5,4],
'C_agg':[7,8,9,4,2,3],
'agg_D':[1,3,5,7,1,0],
'out_E':[5,3,6,9,2,4],
'F_agg':list('aaabbb')
})
prefix = ['out', 'agg']
If need match values for any positions in columns names:
df0 = df.filter(regex='|'.join(prefix)).dropna(how='all')
print (df0)
A_out B_out C_agg agg_D out_E F_agg
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
If need only suffixes add $
for match end of strings:
df1 = df.filter(regex='|'.join(f'{x}$' for x in prefix)).dropna(how='all')
print (df1)
A_out B_out C_agg F_agg
0 a 4 7 a
1 b 5 8 a
2 c 4 9 a
3 d 5 4 b
4 e 5 2 b
5 f 4 3 b
If need only prefixes add ^
for match start of strings:
df2 = df.filter(regex='|'.join(f'^{x}' for x in prefix)).dropna(how='all')
print (df2)
agg_D out_E
0 1 5
1 3 3
2 5 6
3 7 9
4 1 2
5 0 4
Upvotes: 2