Reputation: 801
I'm using Beautiful Soup to try and check if there is an email address in a paragraph tag within a div tag. I'm for looping through a list of the divs:
for div in list_of_divs:
Where each div:
<div>
<p>Hello</p>
<p>[email protected]</p>
</div>
Within the for loop, I have:
email = div.find(name="p", string=re.compile("^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$"))
The name="p" is working fine, but I'm not sure what to put for the string. Any help or direction is appreciated.
Upvotes: 1
Views: 259
Reputation: 627056
You may use
html="""<div>
<p>Hello</p>
<p>[email protected]</p>
</div>"""
soup = BeautifulSoup(html, "html5lib")
list_of_divs = soup.find_all('div')
for div in list_of_divs:
emails = div.find_all("p", string=re.compile(r"^[\w.-]+@(?:[\w-]+\.)+\w{2,4}$"))
print([em.text for em in emails])
Output: ['[email protected]']
Note that ^[\w.-]+@(?:[\w-]+\.)+\w{2,4}$
is quite restrictive, you might possible want to use a more generic one like ^\S+@\S+\.\S+$
that matches 1+ non-whitespace chars, @
, 1+ non-whitespace chars, .
and again 1+ non-whitespace chars.
Notes on the code:
div.find_all("p", string=re.compile(r"^[\w.-]+@(?:[\w-]+\.)+\w{2,4}$"))
, you get all child p
tags of the current div
element whose text matches the regex pattern fullyprint([em.text for em in emails])
prints just texts of all the found p
nodes with only emails in them.Upvotes: 1