Extracting words within HTML tages in Python/Pandas

I have HTML text I scraped which needs to be formatted into a table. I would like to extract everything with the bold tag: <b></b> I have the following code:

import pandas as pd
html='<b>HR</b>Shohei Ohtani<br><b>2B</b>Mike Trout(2)/<br><b>SF</b>Billy Bob'

Thus far, I've tried to put it into a list and then into a dataframe:

html_list=html.split('br')
html_df=pd.DataFrame(html_list,columns=['content'])

This yields:

print(html_df)
                     content
0    <b>HR</b>Shohei Ohtani<
1  ><b>2B</b>Mike Trout(2)/<
2        ><b>SF</b>Billy Bob

I want this:

print(html_df)
                     content var
0    <b>HR</b>Shohei Ohtani< HR
1  ><b>2B</b>Mike Trout(2)/< 2B
2        ><b>SF</b>Billy Bob SF

I tried using beautiful soup and .findall, to no avail. I'm open to different approaches, including reversing some of my steps.

Upvotes: 0

Answers (2)

Tao-Lung Huang

Reputation: 175

Solution

Just use one line of code as follows:

html_df['var'] = html_df['content'].str.extract(r'<b>.*?(.*)</b>')

Result

Upvotes: 1

Shreyesh Desai

Reputation: 719

Is this what you need?:

from bs4 import BeautifulSoup
html='<b>HR</b>Shohei Ohtani<br><b>2B</b>Mike Trout(2)/<br><b>SF</b>Billy Bob'
soup = BeautifulSoup(html)
b_tags = soup.find_all('b')

for b_tag in b_tags:
   print(b_tag.text)

Upvotes: 1

Extracting words within HTML tages in Python/Pandas

Answers (2)

Solution

Result

Related Questions