Reputation: 23
I tried to get some strings from an HTML file with BeautifulSoup and everytime I work with it I get partial results.
I want to get the strings in every li element/tag. So far I've been able to get everything in ul like this.
#!/usr/bin/python
from bs4 import BeautifulSoup
page = open("page.html")
soup = BeautifulSoup(page)
source = soup.select(".sidebar li")
And what I get is this:
[<li class="first">
Def Leppard - Make Love Like A Man<span>Live</span> </li>, <li>
Inxs - Never Tear Us Apart </li>, <li>
Gary Moore - Over The Hills And Far Away </li>, <li>
Linkin Park - Numb </li>, <li>
Vita De Vie - Basul Si Cu Toba Mare </li>, <li>
Nazareth - Love Hurts </li>, <li>
U2 - I Still Haven't Found What I'm L </li>, <li>
Blink 182 - All The Small Things </li>, <li>
Scorpions - Wind Of Change </li>, <li>
Iggy Pop - The Passenger </li>]
I want to get only the strings from this.
Upvotes: 2
Views: 1566
Reputation: 473903
Iterate over results and get the value of text
attribute:
for element in soup.select(".sidebar li"):
print element.text
Example:
from bs4 import BeautifulSoup
data = """
<body>
<ul>
<li class="first">Def Leppard - Make Love Like A Man<span>Live</span> </li>
<li>Inxs - Never Tear Us Apart </li>
</ul>
</body>
"""
soup = BeautifulSoup(data)
for element in soup.select('li'):
print element.text
prints:
Def Leppard - Make Love Like A ManLive
Inxs - Never Tear Us Apart
Upvotes: 1
Reputation: 50580
This example from the documentation gives a very nice one liner.
''.join(BeautifulSoup(source).findAll(text=True))
Upvotes: 0
Reputation: 6332
Use beautiful soups - .strings method.
for string in soup.stripped_strings:
print(repr(string))
from the docs:
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:
or
These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:
Upvotes: 2