ModalBro
ModalBro

Reputation: 554

Extracting texts after <br> with BeautifulSoup

I have a series of web pages I want to scrape text from that all follow different patterns unfortunately. I'm trying to write a scraper that extracts text after <br> tags, as that structure is common to all pages.

The pages follow three basic patterns as best I can tell:

  1. http://www.p2016.org/ads1/bushad120215.html
  2. http://www.p2016.org/ads1/christiead100515.html
  3. http://www.p2016.org/ads1/patakiad041615.html

As I have it now, I'm scraping with the following loop:

  for br in soup.find_all('br'):
        text = br.next_sibling

        try:         
            print text.strip().replace("\t", " ").replace("\r", " ").replace('\n', ' ')
        except AttributeError:
            print('...')

While this script works for some pages, but only grabs some or none of the text for other ones. I've been tearing my hair out on this for the last few days, so any help would be greatly appreciated.

Also, I tried this technique already, but couldn't make it work for all the pages.

Upvotes: 0

Views: 1258

Answers (1)

alecxe
alecxe

Reputation: 474191

I would still continue relying on the underline style of the span elements. Here is a sample code that should help you get started (using .next_siblings):

for span in soup.select('p > span[style*=underline]'):
    texts = []
    for sibling in span.next_siblings:
        # break upon reaching the next span 
        if sibling.name == "span":
            break

        text = sibling.get_text(strip=True) if isinstance(sibling, Tag) else sibling.strip()
        if text:
            texts.append(text.replace("\n", " "))

    if texts:
        text = " ".join(texts)
        print(span.text.strip(), text.strip())

Upvotes: 1

Related Questions