Reputation: 758
Im using
df[colname].str.extract(regex)
to parse a column of strings into several columns. I'd like to be able to assign the column names at the same time, something like:
df[colname].str.extract(regex, columns=cnames)
where:
cnames = ['col1','col2','col3']
regex = r'(sometext\w)_(aa|bb)_(\d+-\d)'
Its possible with a clunky construction like:
df[colname].str.extract(regex).rename(columns = dict(zip(range(len(cnames)),cnames)))
Or else I could embed the column names in the regex as named groups, so the regex changes to:
regex = r'(?P<col1>sometext\w)_(?P<col2>aa|bb)_(?P<col3>\d+-\d)'
Am i missing something here, is there a simpler way? thanks
Upvotes: 5
Views: 747
Reputation: 4742
What you have done with embedding the names into the regex is a correct way of doing this. It states to do this in the documentation.
Your first solution using .rename()
would not be robust if you had some columns with the names 0
, 1
and 2
already.
IMO the regex solution is the best but you could start to use something like .pipe()
to implement a function in this way. However, as you will see, it starts to get messy when you do not want the same regex.
def extract_colnames(df, column, sep, cnames, drop_col=True):
if drop_col:
drop_col = [column]
else:
drop_col = []
regex = '(?P<' + ('>.*)' + sep + '(?P<').join(cnames) + '>.*)'
return df.join(df.loc[:, column].str.extract(regex, expand=True)).drop(drop_col, axis=1)
cnames = ['col1','col2','col3']
data = data.pipe(extract_colnames, column='colname',
sep='_', cnames=cnames, drop_col=True)
Upvotes: 1