alacoste
alacoste

Reputation: 29

Return multiple matches of regular expression within a string in python pandas

I am trying to extract all matches contained in between "><" in a string

The code below only returns the first match in the string.

In:    
import pandas as pd
import re
df = pd.Series(['<option value="85">APOE</option><option value="636">PICALM1<'])
reg = '(>([A-Z])\w+<)'
df2 = df.str.extract(reg)
print df2

Out:
    0   1
0   >APOE<  A

I would like to return "APOE" and "PICALM1" and not just "APOE"

Thanks for your help!

Upvotes: 2

Views: 2899

Answers (2)

user2335580
user2335580

Reputation: 408

import re
import pandas as pd
df['new_col'] =  df['old_col'].str.findall(r'>([A-Z][^<]+)<')

This will store all matches as a list in new_col of dataframe.

Upvotes: 2

Josep Valls
Josep Valls

Reputation: 5560

No need for pandas.

df = '<option value="85">APOE</option><option value="636">PICALM1<'
reg = '>([A-Z][^<]+)<'
print re.findall(reg,df)
['APOE', 'PICALM1']

Parsing HTML with regular expressions may not be the best idea, have you considered using BeautifulSoup?

Upvotes: 0

Related Questions