Reputation: 29
I am trying to extract all matches contained in between "><" in a string
The code below only returns the first match in the string.
In:
import pandas as pd
import re
df = pd.Series(['<option value="85">APOE</option><option value="636">PICALM1<'])
reg = '(>([A-Z])\w+<)'
df2 = df.str.extract(reg)
print df2
Out:
0 1
0 >APOE< A
I would like to return "APOE" and "PICALM1" and not just "APOE"
Thanks for your help!
Upvotes: 2
Views: 2899
Reputation: 408
import re
import pandas as pd
df['new_col'] = df['old_col'].str.findall(r'>([A-Z][^<]+)<')
This will store all matches as a list in new_col of dataframe.
Upvotes: 2
Reputation: 5560
No need for pandas.
df = '<option value="85">APOE</option><option value="636">PICALM1<'
reg = '>([A-Z][^<]+)<'
print re.findall(reg,df)
['APOE', 'PICALM1']
Parsing HTML with regular expressions may not be the best idea, have you considered using BeautifulSoup?
Upvotes: 0