Reputation: 554
I have a series of web pages I want to scrape text from that all follow different patterns unfortunately. I'm trying to write a scraper that extracts text after <br>
tags, as that structure is common to all pages.
The pages follow three basic patterns as best I can tell:
As I have it now, I'm scraping with the following loop:
for br in soup.find_all('br'):
text = br.next_sibling
try:
print text.strip().replace("\t", " ").replace("\r", " ").replace('\n', ' ')
except AttributeError:
print('...')
While this script works for some pages, but only grabs some or none of the text for other ones. I've been tearing my hair out on this for the last few days, so any help would be greatly appreciated.
Also, I tried this technique already, but couldn't make it work for all the pages.
Upvotes: 0
Views: 1258
Reputation: 474191
I would still continue relying on the underline
style of the span elements. Here is a sample code that should help you get started (using .next_siblings
):
for span in soup.select('p > span[style*=underline]'):
texts = []
for sibling in span.next_siblings:
# break upon reaching the next span
if sibling.name == "span":
break
text = sibling.get_text(strip=True) if isinstance(sibling, Tag) else sibling.strip()
if text:
texts.append(text.replace("\n", " "))
if texts:
text = " ".join(texts)
print(span.text.strip(), text.strip())
Upvotes: 1