Reputation: 3235
I am using BeautifulSoup to parse some content from a html page.
I can extract from the html the content I want (i.e. the text contained in a span
defined by the class
myclass).
result = mycontent.find(attrs={'class':'myclass'})
I obtain this result:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>
If I try to extract the text using:
result.get_text()
I obtain:
Lorem ipsumdolor sit amet,consectetur...
As you can see when the tag <br>
is removed there is no more spacing between the contents and two words are concated.
How can I solve this issue?
Upvotes: 16
Views: 53178
Reputation: 46415
Use 'contents' , then replace <br>
?
Here is a full (working, tested) example:
from bs4 import BeautifulSoup
import urllib2
url="http://www.floris.us/SO/bstest.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
result = soup.find(attrs={'class':'myclass'})
print "The result of soup.find:"
print result
print "\nresult.contents:"
print result.contents
print "\nresult.get_text():"
print result.get_text()
for r in result:
if (r.string is None):
r.string = ' '
print "\nAfter replacing all the 'None' with ' ':"
print result.get_text()
Result:
The result of soup.find:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>
result.contents:
[u'Lorem ipsum', <br/>, u'dolor sit amet,', <br/>, u'consectetur...']
result.get_text():
Lorem ipsumdolor sit amet,consectetur...
After replacing all the 'None' with ' ':
Lorem ipsum dolor sit amet, consectetur...
This is more elaborate than Sean's very compact solution - but since I had said I would create and test a solution along the lines I had indicate when I could, I decided to follow through on my promise. You can see a little better what is going on here - the <br/>
is its own element in the result.contents
tuple, but when converted to string there's "nothing left".
Upvotes: 19
Reputation: 160015
If you are using bs4 you can use strings
:
" ".join(result.strings)
Upvotes: 25