Reputation: 2337
I am trying to write a regex that matches columns in my dataframe. All the columns in the dataframe are
cols = ['after_1', 'after_2', 'after_3', 'after_4', 'after_5', 'after_6',
'after_7', 'after_8', 'after_9', 'after_10', 'after_11', 'after_12',
'after_13', 'after_14', 'after_15', 'after_16', 'after_17', 'after_18',
'after_19', 'after_20', 'after_21', 'after_22', 'after_10_missing',
'after_11_missing', 'after_12_missing', 'after_13_missing',
'after_14_missing', 'after_15_missing', 'after_16_missing',
'after_17_missing', 'after_18_missing', 'after_19_missing',
'after_1_missing', 'after_20_missing', 'after_21_missing',
'after_22_missing', 'after_2_missing', 'after_3_missing',
'after_4_missing', 'after_5_missing', 'after_6_missing',
'after_7_missing', 'after_8_missing', 'after_9_missing']
I want to select all the columns that have values in the strings that range from 1-14.
This code works
df.filter(regex = '^after_[1-9]$|after_([1-9]\D|1[0-4])').columns
but I'm wondering how to make it in one line instead of splititng it in two. The first part selects all strings that end in a number between 1 and 9 (i.e. 'after_1' ... 'after_9') but not their "missing" counterparts. The second part (after the |), selects any string that begins with 'after' and is between 1 and 9 and is followed by a word character, or begins with 1 and is followed by 0-4.
Is there a better way to write this?
I already tried
df.filter(regex = 'after_([1-9]|1[0-4])').columns
But that picks up strings that begin with a 1 or a 2 (i.e. 'after_20')
Upvotes: 0
Views: 78
Reputation: 1321
Try this: after_([1-9]|1[0-4])[a-zA-Z_]*\b
import re
regexp = '''(after_)([1-9]|1[0-4])(_missing)*\\b'''
cols = ['after_1', 'after_14', 'after_15', 'after_14_missing', 'after_15_missing', 'after_9_missing']
for i in cols:
print(i , re.findall(regexp, i))
Upvotes: 1