Reputation: 81
I have a web page that has a series of tags with a specific class within the page. The tags I'm interested in look like this:
<span class="my-span-class">
"Text of interest before break"
<br>
"Text of interest after break"
</span>
These elements have no title and are just tags filled with text and are each broken up only by 1
tag. I want my end result to have "Text of interest before break" be in a separate list from "Text of interest after break" like this:
my_list_1 [Text of interest before break #1, Text of interest before break #2, Text of interest before break #3, etc...]
my_list _2 [Text of interest after break #1, Text of interest after break #2, Text of interest after break #3, etc....]
However, I'm struggling to get from what's below to having two separate lists. This currently outputs the two string together like so: "Text of interest before breakText of interest after break"
from bs4 import BeautifulSoup
import urllib.request
f = urllib.request.urlopen("html.html")
soup = BeautifulSoup(f)
# get the tag type that looks like the element shown above
myText = soup.find_all("span", class_="my-span-clas")
results = []
for i in myText:
results.append(i.text.strip())
I want to have a separate list initialized (i.e. results_2 = []) and have "Text of interest after break" be stored there and have the first results list be reserved only for the "Text of interest before break"
Upvotes: 1
Views: 401
Reputation: 378
You can try htql:
import htql
page="""
<span class="my-span-class">
Text of interest before break #1
<br>
Text of interest after break #1
</span>
<span class="my-span-class">
Text of interest before break #2
<br>
Text of interest after break #2
</span>
"""
results1 = htql.query(page, "<span (class='my-span-class')>.<br>1:px &trim ")
results2 = htql.query(page, "<span (class='my-span-class')>.<br>1:fx &trim ")
It produces:
>>> results1
[('Text of interest before break #1',), ('Text of interest before break #2',)]
>>> results2
[('Text of interest after break #1',), ('Text of interest after break #2',)]
Upvotes: 1
Reputation:
You could also use .stripped_strings
in combination with zip(*iterable)
to unpack them seperately.
myTexts = (tag.stripped_strings for tag in soup.find_all("span", class_="my-span-class"))
before, after = zip(*myTexts)
>>> before
('Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2')
>>> after
('Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2')
Upvotes: 1
Reputation: 33384
Based on your html you can use contents
to get the values from the tag.
contents[0]
will return first string
contents[-1]
will return last string
from bs4 import BeautifulSoup
html='''<span class="my-span-class">
Text of interest before break
<br>
Text of interest after break
</span>
<span class="my-span-class">
Text of interest before break 1
<br>
Text of interest after break 1
</span>
<span class="my-span-class">
Text of interest before break 2
<br>
Text of interest after break 2
</span>
'''
soup = BeautifulSoup(html, 'html.parser')
Beforelist=[]
Afterlist=[]
for item in soup.find_all("span", class_="my-span-class"):
Beforelist.append(item.contents[0].strip())
Afterlist.append(item.contents[-1].strip())
print(Beforelist)
print(Afterlist)
Output:
['Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2']
['Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2']
Upvotes: 1
Reputation: 11101
You can use itertools.groupby
to group nodes before and after <br>
.
I've gone ahead and made it a bit more robust by handling non-text elements before and after <br>
.
from bs4 import BeautifulSoup, Tag
import itertools
soup = BeautifulSoup('''
<span class="my-span-class">
before break 1
<span>before break 1.1</span>
<br>
after break 1
</span>
<span class="my-span-class">
before break 2
<br>
after break 2
<span>after break 2.1</span>
</span>
''', 'html.parser')
befores, afters = [], []
for it in soup.select('.my-span-class'):
# this will give you three groups
groups = [list(g) for _, g in itertools.groupby(it.children, lambda c: c.name != 'br')]
# we just need items before br and after br
before, after = [g for g in groups if g[0].name != 'br']
befores.extend(before)
afters.extend(after)
print(befores)
print(afters)
which prints:
['\n before break 1\n ', <span>before break 1.1</span>, '\n', '\n before break 2\n ']
['\n after break 1\n', '\n after break 2\n ', <span>after break 2.1</span>, '\n']
This should be enough to demonstrate how you can partition children under an element.
The only thing left to do is to loop over befores
and afters
and clean up each item.
Upvotes: 1