Reputation: 149
I'm trying to use BeautifulSoup to scrape HTML tags off of something that was returned using ExchangeLib. What I have so far is this:
from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup
credentials = Credentials('[email protected]', 'topSecret')
account = Account('[email protected]', credentials=credentials, autodiscover=True)
for item in account.inbox.all().order_by('-datetime_received')[:1]:
soup = BeautifulSoup(item.unique_body, 'html.parser')
print(soup)
As is, this will use exchangeLib to grab the first email from my inbox via Exchange, and print specifically the unique_body
which contains the body text of the email. Here is a sample of the output from print(soup)
:
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
My end goal is to have it print:
Hey John,
Here is a test email
From what I'm reading on BeautifulSoup documentation, the process of scraping falls between my "Soup ="
line and the final print
line.
My issue is that in order to run the scraping portion of BeautifulSoup, it requires a class and h1 tags such as: name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})
, however from what I currently have, I have none of this.
As someone who is new to Python, how should I go about doing this?
Upvotes: 1
Views: 3366
Reputation: 33384
You can try Find_all
to get all the font
tag value and then iterate.
from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""
soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
print(span.text)
Output:
Hey John,
Here is a test email
Upvotes: 4
Reputation: 84465
You need to print the font tag content. You can use select
method and pass it type selector for the element of font
.
from bs4 import BeautifulSoup as bs
html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''
soup = bs(html, 'lxml')
textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)
Upvotes: 2