Reputation: 469
i am trying to extract some data from a dataframe, however following query only extract the first match and ignores the rest of the matches, for example if the entire data is:
df['value']=
0 123 blah blah blah, 456 blah blah blah, 129kfj blah blah
1 237 blah blah blah, 438 blah blah blah, 365kfj blah blah
...
and the regex is:
df['newCol']=df['value'].str.extract("[0-9]{3}")
i want the result to be a new column name "newCol" as:
newCol
------
123,456,129
237,438,365
...
but the actual result i get is only the first number:
newCol
------
123
237
what is wrong here? :(
thank you
UPDATE:
thanks to MaxU I found the solution, just couple of suggestions. I had Pandas 0.18.1 so extractall didn't work for me untill i updated pandas to 0.19, so remember to check your pandas version if you have issue with Extractall...second, apply(','.join) didn't work for me because I had some non string values (Null values) and it couldn't handle it so I used Lambda and it finally worked with a small modification of MaxU solution.
x['value'].str.extractall(r'(\d{3})').unstack().apply(lambda x:','.join(x.dropna()), axis=1)
Upvotes: 8
Views: 11392
Reputation: 210842
you can use Series.str.extractall() method:
In [57]: x
Out[57]:
value
0 123 blah blah blah 456 blah blah blah 129kfj blah blah
1 237 blah blah blah 438 blah blah blah 365kfj blah blah
In [58]: x['newCol'] = x['value'].str.extractall(r'(\d{3})').unstack().apply(','.join, 1)
In [59]: x
Out[59]:
value newCol
0 123 blah blah blah 456 blah blah blah 129kfj blah blah 123,456,129
1 237 blah blah blah 438 blah blah blah 365kfj blah blah 237,438,365
UPDATE:
In [77]: x
Out[77]:
value
0 123 blah blah blah, 456 blah blah blah, 129kfj blah blah
1 237 blah blah blah, 438 blah blah blah, 365kfj blah blah
In [78]: x['value'].str.extractall(r'(\d{3})').unstack().apply(','.join, 1)
Out[78]:
0 123,456,129
1 237,438,365
dtype: object
Upvotes: 6