trombonebraveheart
trombonebraveheart

Reputation: 81

Extract text blocks between tags separated by <br> without a tag title

I have a web page that has a series of tags with a specific class within the page. The tags I'm interested in look like this:

<span class="my-span-class">
  "Text of interest before break"
  <br>
  "Text of interest after break"    
</span>

These elements have no title and are just tags filled with text and are each broken up only by 1
tag. I want my end result to have "Text of interest before break" be in a separate list from "Text of interest after break" like this:

my_list_1 [Text of interest before break #1, Text of interest before break #2, Text of interest before break #3, etc...]

my_list _2 [Text of interest after break #1, Text of interest after break #2, Text of interest after break #3, etc....]

However, I'm struggling to get from what's below to having two separate lists. This currently outputs the two string together like so: "Text of interest before breakText of interest after break"

from bs4 import BeautifulSoup
import urllib.request

f = urllib.request.urlopen("html.html")

soup = BeautifulSoup(f)

# get the tag type that looks like the element shown above
myText = soup.find_all("span", class_="my-span-clas")

results = []

for i in myText:
    results.append(i.text.strip())

I want to have a separate list initialized (i.e. results_2 = []) and have "Text of interest after break" be stored there and have the first results list be reserved only for the "Text of interest before break"

Upvotes: 1

Views: 401

Answers (4)

seagulf
seagulf

Reputation: 378

You can try htql:

import htql

page="""
<span class="my-span-class">
  Text of interest before break #1
  <br> 
  Text of interest after break #1
</span>
<span class="my-span-class">
  Text of interest before break #2
  <br> 
  Text of interest after break #2
</span>
"""

results1 = htql.query(page, "<span (class='my-span-class')>.<br>1:px &trim ")

results2 = htql.query(page, "<span (class='my-span-class')>.<br>1:fx &trim ")

It produces:

>>> results1
[('Text of interest before break #1',), ('Text of interest before break #2',)]
>>> results2
[('Text of interest after break #1',), ('Text of interest after break #2',)]

Upvotes: 1

user15398259
user15398259

Reputation:

You could also use .stripped_strings in combination with zip(*iterable) to unpack them seperately.

myTexts = (tag.stripped_strings for tag in soup.find_all("span", class_="my-span-class"))
before, after = zip(*myTexts)

>>> before
('Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2')

>>> after
('Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2')

Upvotes: 1

KunduK
KunduK

Reputation: 33384

Based on your html you can use contents to get the values from the tag.

contents[0] will return first string

contents[-1] will return last string

from bs4 import BeautifulSoup
html='''<span class="my-span-class">
  Text of interest before break
  <br>
  Text of interest after break   
</span>
<span class="my-span-class">
  Text of interest before break 1
  <br>
  Text of interest after break 1   
</span>
<span class="my-span-class">
  Text of interest before break 2
  <br>
  Text of interest after break 2    
</span>
'''
soup = BeautifulSoup(html, 'html.parser')
Beforelist=[]
Afterlist=[]
for item in soup.find_all("span", class_="my-span-class"):
    Beforelist.append(item.contents[0].strip())
    Afterlist.append(item.contents[-1].strip())
    
print(Beforelist)
print(Afterlist)

Output:

['Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2']
['Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2']

Upvotes: 1

abdusco
abdusco

Reputation: 11101

You can use itertools.groupby to group nodes before and after <br>.

I've gone ahead and made it a bit more robust by handling non-text elements before and after <br>.

from bs4 import BeautifulSoup, Tag
import itertools

soup = BeautifulSoup('''
<span class="my-span-class">
  before break 1
  <span>before break 1.1</span>
  <br>
  after break 1
</span>

<span class="my-span-class">
  before break 2
  <br>
  after break 2
  <span>after break 2.1</span>
</span>

''', 'html.parser')


befores, afters = [], []
for it in soup.select('.my-span-class'):
    # this will give you three groups
    groups = [list(g) for _, g in itertools.groupby(it.children, lambda c: c.name != 'br')]
    # we just need items before br and after br
    before, after = [g for g in groups if g[0].name != 'br']
    
    befores.extend(before)
    afters.extend(after)
             
print(befores)
print(afters)

which prints:

['\n  before break 1\n  ', <span>before break 1.1</span>, '\n', '\n  before break 2\n  ']
['\n  after break 1\n', '\n  after break 2\n  ', <span>after break 2.1</span>, '\n']

This should be enough to demonstrate how you can partition children under an element.

The only thing left to do is to loop over befores and afters and clean up each item.

Upvotes: 1

Related Questions