Reputation: 177
I am trying to build a pandas dataframe by filtering the name of the columns. The dataframe is styled like:
NAME_1, EMAIL_1, PHONO_1, POSTALCODE_1, NAME_2, EMAIL_2, PHONO_2, POSTALCODE_2, etc.
I understand that to keep only the "email" columns, I just have to put:
df_temp = df.filter(like='EMAIL')
Where the result looks like:
EMAIL_1, EMAIL_2, EMAIL_n
So far so good, but I also want to add other columns to the filter, and also follow the sequence that links the data. I am looking for something like:
NAME_1, EMAIL_1, NAME_2, EMAIL_2
df_temp = df[df.filter(like='NAME') & df.filter(like='EMAIL')]
But it doesn't work.
The following does not return the data as I need it:
df_temp_1 = df.filter(like='NAME')
df_temp_2 = df.filter(like='EMAIL')
df_temp_3 = pd.concat([df_temp_1, df_temp_2], axis=1)
NAME_1, NAME_2, EMAIL_1, EMAIL_2, etc.
I want to see it as:
NAME_1, EMAIL_1, NAME_2, EMAIL_2
I would even like to be able to separate them into different dataframes.
do you recommend a method please?
From already thank you very much.
Regards
Upvotes: 1
Views: 169
Reputation: 28709
one option is with select_columns from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
In [106]: columns = ['NAME_1', 'EMAIL_1', 'PHONO_1', 'POSTALCODE_1', 'NAME_2','EMAIL_2', 'PHONO_2', 'POSTALCODE_2']
In [107]: df = pd.DataFrame([], columns = columns)
In [108]: df
Out[108]:
Empty DataFrame
Columns: [NAME_1, EMAIL_1, PHONO_1, POSTALCODE_1, NAME_2, EMAIL_2, PHONO_2, POSTALCODE_2]
Index: []
In [109]: df.select_columns('NAME*', 'EMAIL*')
Out[109]:
Empty DataFrame
Columns: [NAME_1, NAME_2, EMAIL_1, EMAIL_2]
Index: []
The *
idea is based on unix shell globbing for filtering.
Upvotes: 1
Reputation:
Using a regular expression is the best solution, but you can also extract the columns from the two filters
calls and join them together with union
:
cols = df.filter(like='NAME').columns.union(df.filter(like='EMAIL').columns)
Output:
>>> cols
Index(['EMAIL_1', 'EMAIL_2', 'NAME_1', 'NAME_2'], dtype='object'
Upvotes: 1
Reputation: 323316
Check with startswith
out = df.loc[:,df.columns.str.startswith(('NAME','EMAIL'))]
Upvotes: 1
Reputation: 30050
You can try
out = df.filter(regex='(NAME|EMAIL)_[12]')
If suffix is different
out = df.filter(regex='(NAME_[12]|EMAIL_[3-5])')
Upvotes: 1
Reputation: 120469
Use regex
instead of like
:
out = df.filter(regex=('^(NAME|EMAIL)'))
print(out)
# Output (sample)
NAME_1 EMAIL_1 NAME_2 EMAIL_2
0 5 9 5 9
1 8 2 3 9
2 8 8 1 5
3 6 7 9 5
4 6 6 4 3
Upvotes: 4