assign column names to the output of Series.str.extract()

Question

Im using

df[colname].str.extract(regex)

to parse a column of strings into several columns. I'd like to be able to assign the column names at the same time, something like:

df[colname].str.extract(regex, columns=cnames)

where:

cnames = ['col1','col2','col3']
regex = r'(sometext\w)_(aa|bb)_(\d+-\d)'

Its possible with a clunky construction like:

df[colname].str.extract(regex).rename(columns = dict(zip(range(len(cnames)),cnames)))

Or else I could embed the column names in the regex as named groups, so the regex changes to:

regex = r'(?Psometext\w)_(?Paa|bb)_(?P\d+-\d)'

Am i missing something here, is there a simpler way? thanks

Little Bobby Tables · Accepted Answer

What you have done with embedding the names into the regex is a correct way of doing this. It states to do this in the documentation.

Your first solution using .rename() would not be robust if you had some columns with the names 0, 1 and 2 already.

IMO the regex solution is the best but you could start to use something like .pipe() to implement a function in this way. However, as you will see, it starts to get messy when you do not want the same regex.

def extract_colnames(df, column, sep, cnames, drop_col=True):
    if drop_col:
        drop_col = [column]
    else:
        drop_col = []
    regex = '(?P<' + ('>.*)' + sep + '(?P<').join(cnames) + '>.*)'
    return df.join(df.loc[:, column].str.extract(regex, expand=True)).drop(drop_col, axis=1)

cnames = ['col1','col2','col3']
data = data.pipe(extract_colnames, column='colname',
                 sep='_', cnames=cnames, drop_col=True)

assign column names to the output of Series.str.extract()

Answers (1)

Related Questions