Ryn
Ryn

Reputation: 15

Extracting Messy, Untagged HTML text using Beautiful Soup in Python

I am trying to parse a webpage with a bunch of untagged text using BeautifulSoup. As seen in the example below, the pattern is a name in STRONG tags, followed by a series of untagged text interleaved with line breaks. At the end of each "group" of text there is an <hr> tag to denote the beginning of the next section.

I would like to stick this information in a csv file for the time being. My current thought process is to use soup.find_all("b") to get all of the names. For each name retrieved I would manually cycle thru siblings using something like next_sibling, adding the lines of text to my csv file and ignoring the line breaks. After reaching an <hr> element, move to the next "name" from the soup.find_all("b") results and advance the csv to the next line.

I am not sure if this line of thinking will actually translate to success. For one, I haven't yet figured out how to select each line of untagged text. The various examples I have been able to find involve selecting all untagged text on a page simultaneously, which doesn't do me much good. The other issue is that I am not sure if my suggested method of "navigating" the page contents is logically correct. Trying to get the next_sibiling of an element churned out by soup.find_all("b") returns none in the experiments I've done. Haven't figured that one out yet either.

I admittedly don't have much experience with Beautiful Soup and it has been a minute since I have worked with HTML in general. Looking forward to learning more about this!

<div class="maincontent">
    <b>Thing 1</b>
    <br>
    Text About Thing 1
    <br>
    More Text About Thing 1
    <br>
    Even More Text About Thing 1
    <br>
    Even MORE Text About Thing 1
    <br>
    <hr>
    <b>Thing 2</b>
    <br>
    Text About Thing 2
    <br>
    More Text About Thing 2
    <br>
    Even More Text About Thing 2
    <br>
    Even MORE Text About Thing 2
    <br>
    <hr>
    <b>Thing 3</b>
    <br>
    Text About Thing 3
    <br>
    More Text About Thing 3
    <br>
    Even More Text About Thing 3
    <br>
    Even MORE Text About Thing 3
    <br>
    <hr>
</div>

Edit: The desired output would look like:

Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3

In addition, there is a condition I neglected to include in the example. Some of the "Thing" sections actually look like this:

<div class="maincontent">
    ...
    <b>Thing 4</b>
    <br>
    Text About Thing 4
    <br>
     Text about 
     <a href="www.example.com">
       Thing 4
     </a>
     with a link in the middle.
    <br>
    Even More Text About Thing 4
    <br>
    Even MORE Text About Thing 4
    <br>
    <hr>
    ...
</div>

Ideally the sentence surrounding the link would be trimmed down to one sentence, outputting the following.

Thing4,Text About Thing 4,Text about Thing 4 with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4

Instead of that, my output currently looks like this using the method recommended by HedgeHog.

Thing4,Text About Thing 4,Text about,Thing 4,with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4

Edit 2:

Here is my current solution based heavily on what HedgeHog posted below.

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://www.example.com/"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
maincontent = soup.select_one(".maincontent")

with open('myfile.csv', 'w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')

    for a in maincontent.findAll('a'):
        a.replaceWithChildren()

    for b in maincontent.select('b'):
        d = [b.text]
        isNewElement = True
        for t in b.next_siblings:
            if t.name == 'b':
                break
            if isNewElement:
                isNewElement = False
                if not t.name and t.strip != '':
                    d.append(t.strip())
            else:
                if not t.name and t.strip != '':
                    d[-1] = d[-1] + t
                else:
                    isNewElement = True
        writer.writerow(d)

The only remaining issue is making sure the proper whitespace remains before and after each URL. Everything else I need to do involves reading each string and parsing out certain information, so I should be good from here. Thank you all!

Upvotes: 0

Views: 401

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195543

Another version: you can replace all <hr> in main section with separator of your choose and then use itertools.groupby to get separate blocks of texts, for example:

from itertools import groupby
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your HTML from the question

maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
    hr.replace_with("-" * 80)

text = maincontent.get_text(strip=True, separator="\n")

for is_separator, g in groupby(text.splitlines(), lambda k: k == "-" * 80):
    if not is_separator:
        print(" ".join(g))  # <-- or store it to file instead printing to screen

Prints:

Thing 1 Text About Thing 1 More Text About Thing 1 Even More Text About Thing 1 Even MORE Text About Thing 1
Thing 2 Text About Thing 2 More Text About Thing 2 Even More Text About Thing 2 Even MORE Text About Thing 2
Thing 3 Text About Thing 3 More Text About Thing 3 Even More Text About Thing 3 Even MORE Text About Thing 3

Or just use normal str.split:

soup = BeautifulSoup(html_doc, "html.parser")

maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
    hr.replace_with("-" * 80)

text = maincontent.get_text(strip=True, separator="\n")

for group in map(str.strip, text.split("-" * 80)):
    if group:
        print(group)
        print()

Prints 3 blocks:

Thing 1
Text About Thing 1
More Text About Thing 1
Even More Text About Thing 1
Even MORE Text About Thing 1

Thing 2
Text About Thing 2
More Text About Thing 2
Even More Text About Thing 2
Even MORE Text About Thing 2

Thing 3
Text About Thing 3
More Text About Thing 3
Even More Text About Thing 3
Even MORE Text About Thing 3

Upvotes: 1

HedgeHog
HedgeHog

Reputation: 25196

The path described sounds conclusive, and from my point of view you would almost have reached your goal. Cause expected output in not clear from the question, this one just points into a direction:

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(html)
with open('myfile.csv', 'w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')

    for b in soup.select('b'):
        d = [b.text]
        for t in b.next_siblings:
            if t.name == 'b':
                break
            if not t.name and t.strip() != '':
                d.append(t.strip())
        writer.writerow(d)

Output

Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3

Upvotes: 0

Related Questions