Mr369
Mr369

Reputation: 394

Extract text blocks between <p> tags separate by <br>

I want to parse all text blocks(TEXT CONTENT, BODY CONTENT, & EXTRA CONTENT) from the below sample. As you might notice, all these text blocks locate differently inside each 'p' tag.

<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>

I want to have my final result in table format like:

       Col1             Col2               Col3
TITLE CONTENT #1     BODY CONTENT #1     EXTRA CONTENT #1
TITLE CONTENT #2     BODY CONTENT #2     EXTRA CONTENT #2
TITLE CONTENT #3     BODY CONTENT #3     EXTRA CONTENT #3

I've tried

 for i in soup.find_all('p'):
     title = i.find('strong')
     if not isinstance(title.nextSibling, NavigableString):
         body= title.nextSibling.nextSibling
         extra= body.nextSibling.nextSibling
     else:
         if len(title.nextSibling) > 3:
             body= title.nextSibling
             extra= body.nextSibling.nextSibling
         else:
             body= title.nextSibling.nextSibling.nextSibling
             extra= body.nextSibling.nextSibling

But it doesn't look efficient. I'm wondering if anyone has any better solutions?
Any help will be really appreciated!

Thanks!

Upvotes: 0

Views: 872

Answers (3)

Andrej Kesely
Andrej Kesely

Reputation: 195543

In this case you can use BeautifulSoup's get_text() method with separator= parameter:

data = '''<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>'''


from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

print('{: ^25}{: ^25}{: ^25}'.format('Col1', 'Col2', 'Col3'))
for p in [[i.strip() for i in p.get_text(separator='|').split('|') if i.strip()] for p in soup.select('p.plans')]:
    print(''.join('{: ^25}'.format(i) for i in p))

Prints:

      Col1                     Col2                     Col3           
TITLE CONTENT #1          BODY CONTENT #1         EXTRA CONTENT #1     
TITLE CONTENT #2          BODY CONTENT #2         EXTRA CONTENT #2     
TITLE CONTENT #3          BODY CONTENT #3         EXTRA CONTENT #3     

Upvotes: 0

Willian Vieira
Willian Vieira

Reputation: 746

another way using slicing, assuming that your list is not variable

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")

def slicing(l):
     new_list = []
     for i in range(0,len(l),3):
             new_list.append(l[i:i+3])
     return new_list

result = slicing(list(soup.stripped_strings))
print(result)

output

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

Upvotes: 0

facelessuser
facelessuser

Reputation: 1734

It is important to note that .next_sibling could work as well, you'd have to use some logic to know how many times to call it as you may need to gather multiple text nodes. In this example, I find it easier to simply navigate the descendants noting important characteristics that signal for me to do something different.

You simply have to break down the characteristics of what you are scraping. In this simple case, we know:

  1. When we see the strong element, we want to capture the "title".
  2. When we see the first br element, we want to start capturing the "content" .
  3. When we see the second br element, we want to start capturing the "extra content".

We can:

  1. Target the plans class to get all the plans.
  2. Then we can iterate through the all the descendant nodes of the plans.
  3. If we see a tag, see if it matches one of the conditions above and prepare to capture text nodes in the correct container.
  4. If we see a text node, and we have a container ready, store the text.
  5. Strip unnecessary leading and trailing white space and store the data for the plan.
from bs4 import BeautifulSoup as bs
from bs4 import Tag, NavigableString

html = """
<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>
"""

soup = bs(html, 'html.parser')

content = []

# Iterate through all the plans
for plans in soup.select('.plans'):
    # Lists that will hold the text nodes of interest
    title = []
    body = []
    extra = []

    current = None  # Reference to  one of the above lists to store data
    br = 0  # Count number of br tags

    # Iterate through all the descendant nodes of a plan
    for node in plans.descendants:
        # See if the node is a Tag/Element
        if isinstance(node, Tag):
            if node.name == 'strong':
                # Strong tags/elements contain our title
                # So set the current container for text to the the title list
                current = title
            elif node.name == 'br':
                # We've found a br Tag/Element
                br += 1
                if br == 1:
                    # If this is the first, we need to set the current
                    # container for text to the body list
                    current = body
                elif br == 2:
                    # If this is the second, we need to set the current
                    # container for text to the extra list
                    current = extra
        elif isinstance(node, NavigableString) and current is not None:
            # We've found a navigable string (not a tag/element), so let's
            # store the text node in the current list container.
            # NOTE: You may have to filter out things like HTML comments in a real world example.
            current.append(node)

    # Store the captured title, body, and extra text for the current plan.
    # For each list, join the text into one string and strip leading and trailing whitespace
    # from each entry in the row.
    content.append([''.join(entry).strip() for entry in (title, body, extra)])

print(content)

Then you can print the data anyway you want, but you should have it captured in a nice logical way as shown below:

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

There are multiple ways to do this, this is just one.

Upvotes: 1

Related Questions