Josh
Josh

Reputation: 12791

Pandas str.extract to retrieve a dictionary from a string field

Say we have a Series holding a string like the following one:

0                         FOO
1                         NaN
2                     FOO x 9
3                         NaN
4                         NaN
5                     FOO x 2
6                         NaN
7                         NaN
8          BAR x 2, BAZ  x 11
9                         NaN

I am hoping to convert this to a dataframe, so that we end up with:

    FOO   BAR   BAZ
0     1     0     0
1     0     0     0
2     9     0     0
3     0     0     0
4     0     0     0
5     2     0     0
6     0     0     0
7     0     0     0
8     0     2    11
9     0     0     0

We can assume we know the list of "keys" (FOO, BAR and BAZ a priori).

Is this doable with str.extract?

Upvotes: 1

Views: 279

Answers (1)

akuiper
akuiper

Reputation: 214927

Assuming what you are showing here is what you have, i.e. all the key value pairs follow the pattern. key + spaces + x + spaces + value, you can use (?P<key>\w+)(?:\s+x\s+(?P<value>\d+))? to capture it. To break this down:

  • (?P<key>\w+) captures a named group key which consists of word characters, i.e. [0-9A-Za-z_] adjust this if it is not the case;
  • (?:\s+x\s+(?P<value>\d+))? captures an optional group which will be missing if the value is 1;
  • \s+x\s+ captures the splitting strings which will be ignored in the result;
  • (?P<value>\d+) captures another named group value which consists of digits;

Combined together with extractall which makes multiple rows if there are multiple matches as in the case of row 8:

df1 = (df[1].str.extractall("(?P<key>\w+)(?:\s+x\s+(?P<value>\d+))?")
            .fillna(1).reset_index(level=1, drop=True))
df1

enter image description here

df1 gives a reduced data set after extracting all the key value pairs, to transform it to the format you needed, you can unstack the key column to make it column headers and reindex it with the index of the original data frame:

df1.set_index('key', append=True).value.unstack(level=1).reindex(df.index).fillna(0)

enter image description here

Upvotes: 3

Related Questions