Rob Buckley
Rob Buckley

Reputation: 758

assign column names to the output of Series.str.extract()

Im using

df[colname].str.extract(regex) 

to parse a column of strings into several columns. I'd like to be able to assign the column names at the same time, something like:

df[colname].str.extract(regex, columns=cnames) 

where:

cnames = ['col1','col2','col3']
regex = r'(sometext\w)_(aa|bb)_(\d+-\d)'

Its possible with a clunky construction like:

df[colname].str.extract(regex).rename(columns = dict(zip(range(len(cnames)),cnames)))

Or else I could embed the column names in the regex as named groups, so the regex changes to:

regex = r'(?P<col1>sometext\w)_(?P<col2>aa|bb)_(?P<col3>\d+-\d)'

Am i missing something here, is there a simpler way? thanks

Upvotes: 5

Views: 747

Answers (1)

Little Bobby Tables
Little Bobby Tables

Reputation: 4742

What you have done with embedding the names into the regex is a correct way of doing this. It states to do this in the documentation.

Your first solution using .rename() would not be robust if you had some columns with the names 0, 1 and 2 already.

IMO the regex solution is the best but you could start to use something like .pipe() to implement a function in this way. However, as you will see, it starts to get messy when you do not want the same regex.

def extract_colnames(df, column, sep, cnames, drop_col=True):
    if drop_col:
        drop_col = [column]
    else:
        drop_col = []
    regex = '(?P<' + ('>.*)' + sep + '(?P<').join(cnames) + '>.*)'
    return df.join(df.loc[:, column].str.extract(regex, expand=True)).drop(drop_col, axis=1)

cnames = ['col1','col2','col3']
data = data.pipe(extract_colnames, column='colname',
                 sep='_', cnames=cnames, drop_col=True)

Upvotes: 1

Related Questions