Scrape text not contained in any element

Question

I'm scraping a very poorly written site with Beautiful Soup 4. I've got everything but the user's email address, which isn't in any containing element that distinguishes it. Any ideas how to scrape it? next_sibling of the strong element skips right over it, as I expected.


 
  E-mail address:
 
 useremail@yahoo.com

jedwards · Accepted Answer

I'm not sure this is the best way, but you could get the parent element, then iterate over its children and look at the non-tags:

from bs4 import BeautifulSoup
import bs4

html='''

 
  E-mail address:
 
 useremail@yahoo.com
 
  
'''


def print_if_email(s):
    if '@' in s: print s

soup = BeautifulSoup(html)

# Iterate over all divs, you could narrow this down if you had more information
for div in soup.findAll('div'):
    # Iterate over the children of each matching div
    for c in div.children:
        # If it wasn't parsed as a tag, it may be a NavigableString
        if isinstance(c, bs4.element.NavigableString):
            # Some heuristic to identify email addresses if other non-tags exist
            print_if_email(c.strip())

Prints:

useremail@yahoo.com

Of course this the inner for loop and if statement could be combined to:

for c in filter(lambda c: isinstance(c, bs4.element.NavigableString), div.children):

Scrape text not contained in any element

Answers (2)

Related Questions