Find index for column name matching regex in pandas

Question

I am trying to identify the index of a column name based on a matching regex condition. However df.columns produces an object of Type pandas.core.indexes.base.Index, that does not have indexes. I want the index so that i can slice the df to get rid of columns that I don't need.

Here is a worked example:

#create a df with column names
df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df.columns
> Index(['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'], dtype='object')

I want to get rid of all the metadata columns.

On a series I would try something like:

df.columns[df.columns.str.contains('meta')].index[0]
> ValueError: Cannot mask with non-boolean array containing NA / NaN values

So I try changing with .astype('str'):

df.columns.astype('str')[df.columns.astype('str').str.contains('meta')].index[0]
> AttributeError: 'Index' object has no attribute 'index'

So my Index has no method .index. So I am left to convert to a list and enumerate with re condition:

[i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)]
> [4, 5]

This works so I can do the following:

cutoff = [i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)][0]
df = df.iloc[:,:cutoff]

This however seems extraordinary for such a menial task. In R this would be as simple as:

cutoff <- min(grep('meta', colnames(df))) - 1 #-1 to address non-zero indexing
df <- df[, seq(1, cutoff)]

Is there no easier way to do this in pandas, other than to 1) convert to string, 2) convert to list, 3) enumerate list? Essentially I would have thought there was an equivalent of the min(grep('meta', colnames(df))) - 1 line.

Chiheb Nexus · Accepted Answer

Here is a solution that may fulfill your needs:

import pandas as pd

df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df_ = df.columns.to_frame(index=False, name='index')
matched = df_.loc[
   df_['index'].str.contains(r'metadata\d+|metadat\d+', na=False)
].index.values

print(matched)

Output:

array([4, 5])

You could also use the get_indexer method, to get the index positions:

df.columns.get_indexer(df.columns[df.columns.str.contains('meta', na = False)])
array([4, 5])

If you just want to filter out the meta columns, you can use boolean indexing in loc:

df.loc[:, ~df.columns.str.contains('meta', na = False)]
 
Empty DataFrame
Columns: [Country, 2010, 2011, 2012]
Index: []

Find index for column name matching regex in pandas

Answers (2)

Related Questions