New Pandas Columns with Regex Parsing

Question

I am trying to parse text data in Pandas DataFrame based on certain tags and values in another column's fields and store them in their own columns. For example, if I created this dataframe, df:

df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']])
df = df.T
df.columns = ['col1','col2']


df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x))
all_tags = []

for val in df['tags']:
    all_tags = all_tags + val
all_tags = list(set(all_tags))
for val in all_tags:
    df[val] = ''

df:
  col1                                               col2       tags A C B
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]      
1    2           A: and heres another a. C: and another c     [A, C]

How would I populate each of the new "tag" columns with their values from col2 so I get this df:

col1                                               col2           tags  \
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]   
1    2           A: and heres another a. C: and another c     [A, C]   

                  A               C                  B  
0       this is a value  and here is c.  this is the b val  
1  and heres another a.   and another c

akuiper · Accepted Answer

Another option using str.extractall with regex (?P\w+):(?P[^:]*)(?=\w+:|$):

The regex captures the key (?P\w+) before the semi colon and value after the semi colon (?P[^:]*) as two separate columns key and val, the val will match non : characters until it reaches the next key value pair restricted by a look ahead syntax (?=\w+:|$); This assumes the key is always a single word which would be ambiguous otherwise:

import re
pat = re.compile("(?P\w+):(?P[^:]*)(?=\w+:|$)")

pd.concat([
    df,
    (
        df.col2.str.extractall(pat)
          .reset_index('match', drop=True)
          .set_index('key', append=True)
          .val.unstack('key')
    )
], axis=1).fillna('')

Where str.extractall gives:

df.col2.str.extractall(pat)

And then you pivot the result and concatenate with the original data frame.

New Pandas Columns with Regex Parsing

Answers (2)

Related Questions