Reputation: 1271
I am trying to identify the index of a column name based on a matching regex condition. However df.columns
produces an object of Type pandas.core.indexes.base.Index
, that does not have indexes. I want the index so that i can slice the df to get rid of columns that I don't need.
Here is a worked example:
#create a df with column names
df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])
df.columns
> Index(['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'], dtype='object')
I want to get rid of all the metadata columns.
On a series
I would try something like:
df.columns[df.columns.str.contains('meta')].index[0]
> ValueError: Cannot mask with non-boolean array containing NA / NaN values
So I try changing with .astype('str')
:
df.columns.astype('str')[df.columns.astype('str').str.contains('meta')].index[0]
> AttributeError: 'Index' object has no attribute 'index'
So my Index has no method .index
. So I am left to convert to a list and enumerate with re
condition:
[i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)]
> [4, 5]
This works so I can do the following:
cutoff = [i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)][0]
df = df.iloc[:,:cutoff]
This however seems extraordinary for such a menial task. In R this would be as simple as:
cutoff <- min(grep('meta', colnames(df))) - 1 #-1 to address non-zero indexing
df <- df[, seq(1, cutoff)]
Is there no easier way to do this in pandas, other than to 1) convert to string, 2) convert to list, 3) enumerate list? Essentially I would have thought there was an equivalent of the min(grep('meta', colnames(df))) - 1
line.
Upvotes: 1
Views: 1812
Reputation: 9267
Here is a solution that may fulfill your needs:
import pandas as pd
df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])
df_ = df.columns.to_frame(index=False, name='index')
matched = df_.loc[
df_['index'].str.contains(r'metadata\d+|metadat\d+', na=False)
].index.values
print(matched)
Output:
array([4, 5])
You could also use the get_indexer method, to get the index positions:
df.columns.get_indexer(df.columns[df.columns.str.contains('meta', na = False)])
array([4, 5])
If you just want to filter out the meta
columns, you can use boolean indexing in loc
:
df.loc[:, ~df.columns.str.contains('meta', na = False)]
Empty DataFrame
Columns: [Country, 2010, 2011, 2012]
Index: []
Upvotes: 1
Reputation:
You can combine .drop()
and .filter()
>>> df.filter(like='meta')
metadata1 metadat2
0 e f
1 k l
>>> df.drop(columns=df.filter(like='meta'))
Country 2010 2011 2012
0 a b c d
1 g h i j
You can also use regex=
to find all columns without meta
>>> df.filter(regex='^(?:(?!meta).)+$')
Country 2010 2011 2012
0 a b c d
1 g h i j
Upvotes: 3