Reputation: 6221
I'm scraping a very poorly written site with Beautiful Soup 4. I've got everything but the user's email address, which isn't in any containing element that distinguishes it. Any ideas how to scrape it? next_sibling
of the strong element skips right over it, as I expected.
<div class="fieldset-wrapper">
<strong>
E-mail address:
</strong>
[email protected]
<div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
<div class="field-items">
Upvotes: 0
Views: 608
Reputation: 1295
I can't answer your question directly as I've never used Beautiful Soup (so do NOT accept this answer!) but just want to remind you if the pages are all pretty simple, an alternative option might be to write your own parser using .split()
?
This is rather clumsy, but worth considering if pages are simple/predictable...
That is, if you know something about the overall layout of the page (e.g., user email is always first email mentioned) you could write your own parser, to find the bit before and after the '@' sign
# html = the entire document as a string
# return the entire document up to the '@' sign
bit_before_at_sign = html.split('@')[0]
# only useful if you know first email is the one you care about
# you could then cut out everything before username with something like this
b = bit_before_at_sign
# a very long string, we just want the last bit right before the @ sign
username = b.split(' ')[-1].split('\n')[-1].split('\r')[-1].split('\r')[-1].split(';')[-1]
# add more if required, depending on how the html looks to you
# (I've just guessed some html elements that might precede the username)
# you could similarly parse the bit after the @ sign,
# html.split('@')[1]
# e.g., checking the first few characters of this
# against a known list of .tlds like '.com', '.co.uk', etc
# (remember some TLDs have more than one period, so don't just parse by '.')
# and combine with the username you already know
Also at your disposal, in case you want to narrow down which bit of the document you focus on:
In case you want to make sure the word 'e-mail' is also in the string you're parsing
if 'email' in lower(b) or 'e-mail' in lower(b):
# do something...
To check where in the document the @ symbol first appears
html.index('@')
# e.g., if you want to see how near this '@' symbol is to some other element you know about
# such as the word 'e-mail', or a particular div element or '</strong>'
To confine your search for an email to the 300 characters before/after another element you know about:
startfrom = html.index('</strong>')
html_i_will_search = html[startfrom:startfrom+300]
I imagine a few minutes more on Google may alternatively prove useful; your task doesn't sound unusual :)
And make sure you consider cases where there are multiple email addresses on the page (e.g., so you don't assign [email protected] to every user!)
Whatever method you go with, if you have doubts, might be worth checking your answer using email.utils.parseaddr() or someone else's regex checker. See previous question
Upvotes: 0
Reputation: 30260
I'm not sure this is the best way, but you could get the parent element, then iterate over its children and look at the non-tags:
from bs4 import BeautifulSoup
import bs4
html='''
<div class="fieldset-wrapper">
<strong>
E-mail address:
</strong>
[email protected]
<div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
<div class="field-items">
'''
def print_if_email(s):
if '@' in s: print s
soup = BeautifulSoup(html)
# Iterate over all divs, you could narrow this down if you had more information
for div in soup.findAll('div'):
# Iterate over the children of each matching div
for c in div.children:
# If it wasn't parsed as a tag, it may be a NavigableString
if isinstance(c, bs4.element.NavigableString):
# Some heuristic to identify email addresses if other non-tags exist
print_if_email(c.strip())
Prints:
[email protected]
Of course this the inner for loop and if statement could be combined to:
for c in filter(lambda c: isinstance(c, bs4.element.NavigableString), div.children):
Upvotes: 2