Reputation: 87
I have HTML text I scraped which needs to be formatted into a table. I would like to extract everything with the bold tag: <b></b>
I have the following code:
import pandas as pd
html='<b>HR</b>Shohei Ohtani<br><b>2B</b>Mike Trout(2)/<br><b>SF</b>Billy Bob'
Thus far, I've tried to put it into a list and then into a dataframe:
html_list=html.split('br')
html_df=pd.DataFrame(html_list,columns=['content'])
This yields:
print(html_df)
content
0 <b>HR</b>Shohei Ohtani<
1 ><b>2B</b>Mike Trout(2)/<
2 ><b>SF</b>Billy Bob
I want this:
print(html_df)
content var
0 <b>HR</b>Shohei Ohtani< HR
1 ><b>2B</b>Mike Trout(2)/< 2B
2 ><b>SF</b>Billy Bob SF
I tried using beautiful soup and .findall, to no avail. I'm open to different approaches, including reversing some of my steps.
Upvotes: 0
Views: 62
Reputation: 175
Just use one line of code as follows:
html_df['var'] = html_df['content'].str.extract(r'<b>.*?(.*)</b>')
Upvotes: 1
Reputation: 719
Is this what you need?:
from bs4 import BeautifulSoup
html='<b>HR</b>Shohei Ohtani<br><b>2B</b>Mike Trout(2)/<br><b>SF</b>Billy Bob'
soup = BeautifulSoup(html)
b_tags = soup.find_all('b')
for b_tag in b_tags:
print(b_tag.text)
Upvotes: 1