HunterLiu
HunterLiu

Reputation: 801

How to check if email exists in p tag using Beautiful Soup?

I'm using Beautiful Soup to try and check if there is an email address in a paragraph tag within a div tag. I'm for looping through a list of the divs:

for div in list_of_divs:

Where each div:

<div>
  <p>Hello</p>
  <p>[email protected]</p>
</div>

Within the for loop, I have:

email = div.find(name="p", string=re.compile("^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$"))

The name="p" is working fine, but I'm not sure what to put for the string. Any help or direction is appreciated.

Upvotes: 1

Views: 259

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627056

You may use

html="""<div>
  <p>Hello</p>
  <p>[email protected]</p>
</div>"""
soup = BeautifulSoup(html, "html5lib")
list_of_divs = soup.find_all('div')
for div in list_of_divs:
    emails = div.find_all("p", string=re.compile(r"^[\w.-]+@(?:[\w-]+\.)+\w{2,4}$"))
    print([em.text for em in emails])

Output: ['[email protected]']

Note that ^[\w.-]+@(?:[\w-]+\.)+\w{2,4}$ is quite restrictive, you might possible want to use a more generic one like ^\S+@\S+\.\S+$ that matches 1+ non-whitespace chars, @, 1+ non-whitespace chars, . and again 1+ non-whitespace chars.

Notes on the code:

  • With div.find_all("p", string=re.compile(r"^[\w.-]+@(?:[\w-]+\.)+\w{2,4}$")), you get all child p tags of the current div element whose text matches the regex pattern fully
  • print([em.text for em in emails]) prints just texts of all the found p nodes with only emails in them.

Upvotes: 1

Related Questions