Reputation: 10051

Rename columns with regex and Pandas to extract contents between specific punctuations

Given a test dataset as follows:

  city district  ... Q3:*your age[open question]  Q4:*skill[open question]
0   bj       cy  ...                          45                              R
1   bj       cy  ...                          34                         Python

I need to rename columns by using regex to extract the contents between * and [, also removing your if they exist. Please note in real case, I have many Questions columns as below.

df.columns
Out[112]: 
Index(['city', 'district', 'name', 'Q1:*your tel[open question]',
       'Q2:*your profession[close question]', 'Q3:*your age[open question]',
       'Q4:*skill[open question]'],
      dtype='object')

The expected columns will like this:

['city', 'district', 'name', 'tel', 'profession', 'age', 'skill']

How could I do that in Pandas and regex? Many thanks at advance.

Upvotes: 1

Answers (3)

wwnde

Reputation: 26676

df13 = df.filter(regex = '^[^Q]', axis = 1) #Isolate columns without Q
df12 = df.filter(regex = '^Q', axis = 1) #isolate columns with Q
x = df.filter(regex = '^Q', axis = 1).reindex(df.filter(regex = '^Q', axis = 1).\
columns.str.findall('([a-z]+(?=\[))').str.join(','),\
axis = "columns").columns # transform column names with q
df12.columns=list(x) # reset df12 column names
pd.concat([df13, df12], axis = 1)

Example Data

df=pd.DataFrame({'city':[1], 'district':[1], 'name':[1], 'Q1:*your tel[open question]':[1],
       'Q2:*your profession[close question]':[1], 'Q3:*your age[open question]':[1],
       'Q4:*skill[open question]':[1]})

print(df)

 city  district  name  Q1:*your tel[open question]  \
0     1         1     1                            1   

   Q2:*your profession[close question]  Q3:*your age[open question]  \
0                                    1                            1   

   Q4:*skill[open question]  
0                         1

Outcome

    city  district  name  tel  profession  age  skill
0     1         1     1    1           1    1      1

Upvotes: 1

Ralubrusto

Reputation: 1501

Whenever facing problems like this, you may wanna give yourself some time to think what pattern does match what you want. For this I'd recommend this site, where you can paste your target text and try some patterns out. It takes some time, but that is how we actually learn (regex are always a good exercise for the brain).

For your case, the pattern you desire is r'\*(your)?(.*)\[', where the second matching group is the word to be new name of the column. So you can try something like :

import re

pattern = r'\*(your)?(.*)\['

print('# Before\n', df.columns)
df = df.rename({
     col: re.search(pattern, col).group(2).strip() 
     if re.search(pattern, col) 
     else col 
     for col in df.columns
     }, axis=1)
print('# After\n', df.columns)

And the output will be:

# Before
 Index(['city', 'district', 'name', 'Q1:*your tel[open question]',
       'Q2:*your profession[close question]', 'Q3:*your age[open question]',
       'Q4:*skill[open question]'],
      dtype='object')
# After
 Index(['city', 'district', 'name', 'tel', 'profession', 'age', 'skill'], dtype='object')

Upvotes: 1

sharathnatraj

Reputation: 1614

Try:

cols = []
for i in df.columns:
    if re.search(r'Q\d:',i) != None:
        cols.append(re.match(r'^Q\d:\*(your\s)?([\w]*)',i).group(2))
    else:
        cols.append(i)

Oneliner substitute for above:

[(re.match(r'^Q\d:\*(your\s)?([\w]*)',i).group(2)) if re.search(r'Q\d:',i) is not None else i for i in df.columns]

Both prints:

['city', 'district', 'name', 'tel', 'profession', 'age', 'skill']

Upvotes: 1

Rename columns with regex and Pandas to extract contents between specific punctuations

Answers (3)

Related Questions