Reputation: 12791
Say we have a Series holding a string like the following one:
0 FOO
1 NaN
2 FOO x 9
3 NaN
4 NaN
5 FOO x 2
6 NaN
7 NaN
8 BAR x 2, BAZ x 11
9 NaN
I am hoping to convert this to a dataframe, so that we end up with:
FOO BAR BAZ
0 1 0 0
1 0 0 0
2 9 0 0
3 0 0 0
4 0 0 0
5 2 0 0
6 0 0 0
7 0 0 0
8 0 2 11
9 0 0 0
We can assume we know the list of "keys" (FOO
, BAR
and BAZ
a priori).
Is this doable with str.extract
?
Upvotes: 1
Views: 279
Reputation: 214927
Assuming what you are showing here is what you have, i.e. all the key value pairs follow the pattern. key + spaces + x + spaces + value, you can use (?P<key>\w+)(?:\s+x\s+(?P<value>\d+))?
to capture it. To break this down:
(?P<key>\w+)
captures a named group key
which consists of word characters, i.e. [0-9A-Za-z_]
adjust this if it is not the case;(?:\s+x\s+(?P<value>\d+))?
captures an optional group which will be missing if the value is 1;\s+x\s+
captures the splitting strings which will be ignored in the result;(?P<value>\d+)
captures another named group value
which consists of digits;Combined together with extractall
which makes multiple rows if there are multiple matches as in the case of row 8:
df1 = (df[1].str.extractall("(?P<key>\w+)(?:\s+x\s+(?P<value>\d+))?")
.fillna(1).reset_index(level=1, drop=True))
df1
df1
gives a reduced data set after extracting all the key value pairs, to transform it to the format you needed, you can unstack the key column to make it column headers and reindex it with the index of the original data frame:
df1.set_index('key', append=True).value.unstack(level=1).reindex(df.index).fillna(0)
Upvotes: 3