Reputation: 10051
Given a test dataset as follows:
city district ... Q3:*your age[open question] Q4:*skill[open question]
0 bj cy ... 45 R
1 bj cy ... 34 Python
I need to rename columns by using regex to extract the contents between *
and [
, also removing your
if they exist. Please note in real case, I have many Questions columns as below.
df.columns
Out[112]:
Index(['city', 'district', 'name', 'Q1:*your tel[open question]',
'Q2:*your profession[close question]', 'Q3:*your age[open question]',
'Q4:*skill[open question]'],
dtype='object')
The expected columns will like this:
['city', 'district', 'name', 'tel', 'profession', 'age', 'skill']
How could I do that in Pandas and regex? Many thanks at advance.
Upvotes: 1
Views: 280
Reputation: 26676
df13 = df.filter(regex = '^[^Q]', axis = 1) #Isolate columns without Q
df12 = df.filter(regex = '^Q', axis = 1) #isolate columns with Q
x = df.filter(regex = '^Q', axis = 1).reindex(df.filter(regex = '^Q', axis = 1).\
columns.str.findall('([a-z]+(?=\[))').str.join(','),\
axis = "columns").columns # transform column names with q
df12.columns=list(x) # reset df12 column names
pd.concat([df13, df12], axis = 1)
Example Data
df=pd.DataFrame({'city':[1], 'district':[1], 'name':[1], 'Q1:*your tel[open question]':[1],
'Q2:*your profession[close question]':[1], 'Q3:*your age[open question]':[1],
'Q4:*skill[open question]':[1]})
print(df)
city district name Q1:*your tel[open question] \
0 1 1 1 1
Q2:*your profession[close question] Q3:*your age[open question] \
0 1 1
Q4:*skill[open question]
0 1
Outcome
city district name tel profession age skill
0 1 1 1 1 1 1 1
Upvotes: 1
Reputation: 1501
Whenever facing problems like this, you may wanna give yourself some time to think what pattern does match what you want. For this I'd recommend this site, where you can paste your target text and try some patterns out. It takes some time, but that is how we actually learn (regex are always a good exercise for the brain).
For your case, the pattern you desire is r'\*(your)?(.*)\['
, where the second matching group is the word to be new name of the column. So you can try something like :
import re
pattern = r'\*(your)?(.*)\['
print('# Before\n', df.columns)
df = df.rename({
col: re.search(pattern, col).group(2).strip()
if re.search(pattern, col)
else col
for col in df.columns
}, axis=1)
print('# After\n', df.columns)
And the output will be:
# Before
Index(['city', 'district', 'name', 'Q1:*your tel[open question]',
'Q2:*your profession[close question]', 'Q3:*your age[open question]',
'Q4:*skill[open question]'],
dtype='object')
# After
Index(['city', 'district', 'name', 'tel', 'profession', 'age', 'skill'], dtype='object')
Upvotes: 1
Reputation: 1614
Try:
cols = []
for i in df.columns:
if re.search(r'Q\d:',i) != None:
cols.append(re.match(r'^Q\d:\*(your\s)?([\w]*)',i).group(2))
else:
cols.append(i)
Oneliner substitute for above:
[(re.match(r'^Q\d:\*(your\s)?([\w]*)',i).group(2)) if re.search(r'Q\d:',i) is not None else i for i in df.columns]
Both prints:
['city', 'district', 'name', 'tel', 'profession', 'age', 'skill']
Upvotes: 1