MorrisseyJ
MorrisseyJ

Reputation: 1271

Find index for column name matching regex in pandas

I am trying to identify the index of a column name based on a matching regex condition. However df.columns produces an object of Type pandas.core.indexes.base.Index, that does not have indexes. I want the index so that i can slice the df to get rid of columns that I don't need.

Here is a worked example:

#create a df with column names
df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df.columns
> Index(['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'], dtype='object')

I want to get rid of all the metadata columns.

On a series I would try something like:

df.columns[df.columns.str.contains('meta')].index[0]
> ValueError: Cannot mask with non-boolean array containing NA / NaN values

So I try changing with .astype('str'):

df.columns.astype('str')[df.columns.astype('str').str.contains('meta')].index[0]
> AttributeError: 'Index' object has no attribute 'index'

So my Index has no method .index. So I am left to convert to a list and enumerate with re condition:

[i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)]
> [4, 5]

This works so I can do the following:

cutoff = [i for i, item in enumerate(df.columns.astype('str').to_list()) if re.findall('meta', item)][0]
df = df.iloc[:,:cutoff]

This however seems extraordinary for such a menial task. In R this would be as simple as:

cutoff <- min(grep('meta', colnames(df))) - 1 #-1 to address non-zero indexing
df <- df[, seq(1, cutoff)]

Is there no easier way to do this in pandas, other than to 1) convert to string, 2) convert to list, 3) enumerate list? Essentially I would have thought there was an equivalent of the min(grep('meta', colnames(df))) - 1 line.

Upvotes: 1

Views: 1812

Answers (2)

Chiheb Nexus
Chiheb Nexus

Reputation: 9267

Here is a solution that may fulfill your needs:

import pandas as pd

df = pd.DataFrame(columns = ['Country', 2010, 2011, 2012, 'metadata1', 'metadat2'])

df_ = df.columns.to_frame(index=False, name='index')
matched = df_.loc[
   df_['index'].str.contains(r'metadata\d+|metadat\d+', na=False)
].index.values

print(matched)

Output:

array([4, 5])

You could also use the get_indexer method, to get the index positions:

df.columns.get_indexer(df.columns[df.columns.str.contains('meta', na = False)])
array([4, 5])

If you just want to filter out the meta columns, you can use boolean indexing in loc:

df.loc[:, ~df.columns.str.contains('meta', na = False)]
 
Empty DataFrame
Columns: [Country, 2010, 2011, 2012]
Index: []

Upvotes: 1

user15398259
user15398259

Reputation:

You can combine .drop() and .filter()

>>> df.filter(like='meta')
  metadata1 metadat2
0         e        f
1         k        l
>>> df.drop(columns=df.filter(like='meta'))
  Country 2010 2011 2012
0       a    b    c    d
1       g    h    i    j

You can also use regex= to find all columns without meta

>>> df.filter(regex='^(?:(?!meta).)+$')
  Country 2010 2011 2012
0       a    b    c    d
1       g    h    i    j

Upvotes: 3

Related Questions