JK72
JK72

Reputation: 149

Returning body text using BeautifulSoup

I'm trying to use BeautifulSoup to scrape HTML tags off of something that was returned using ExchangeLib. What I have so far is this:

from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup

credentials = Credentials('[email protected]', 'topSecret')
account = Account('[email protected]', credentials=credentials, autodiscover=True)

for item in account.inbox.all().order_by('-datetime_received')[:1]:
    soup = BeautifulSoup(item.unique_body, 'html.parser')
    print(soup)

As is, this will use exchangeLib to grab the first email from my inbox via Exchange, and print specifically the unique_body which contains the body text of the email. Here is a sample of the output from print(soup):

<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>

My end goal is to have it print:

Hey John,
Here is a test email

From what I'm reading on BeautifulSoup documentation, the process of scraping falls between my "Soup =" line and the final print line.

My issue is that in order to run the scraping portion of BeautifulSoup, it requires a class and h1 tags such as: name_box = soup.find(‘h1’, attrs={‘class’: ‘name’}), however from what I currently have, I have none of this.

As someone who is new to Python, how should I go about doing this?

Upvotes: 1

Views: 3366

Answers (2)

KunduK
KunduK

Reputation: 33384

You can try Find_all to get all the font tag value and then iterate.

from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
      print(span.text)

Output:

Hey John,

Here is a test email

Upvotes: 4

QHarr
QHarr

Reputation: 84465

You need to print the font tag content. You can use select method and pass it type selector for the element of font.

from bs4 import BeautifulSoup as bs

html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''

soup = bs(html, 'lxml')

textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)

Upvotes: 2

Related Questions