user601836
user601836

Reputation: 3235

Suggestions on get_text() in BeautifulSoup

I am using BeautifulSoup to parse some content from a html page.

I can extract from the html the content I want (i.e. the text contained in a span defined by the class myclass).

result = mycontent.find(attrs={'class':'myclass'})

I obtain this result:

<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

If I try to extract the text using:

result.get_text()

I obtain:

Lorem ipsumdolor sit amet,consectetur...

As you can see when the tag <br> is removed there is no more spacing between the contents and two words are concated.

How can I solve this issue?

Upvotes: 16

Views: 53178

Answers (3)

explorer
explorer

Reputation: 313

result.get_text(separator=" ") should work.

Upvotes: 7

Floris
Floris

Reputation: 46415

Use 'contents' , then replace <br>?

Here is a full (working, tested) example:

from bs4 import BeautifulSoup
import urllib2

url="http://www.floris.us/SO/bstest.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

result = soup.find(attrs={'class':'myclass'})
print "The result of soup.find:"
print result

print "\nresult.contents:"
print result.contents
print "\nresult.get_text():"
print result.get_text()
for r in result:
  if (r.string is None):
    r.string = ' '

print "\nAfter replacing all the 'None' with ' ':"
print result.get_text()

Result:

The result of soup.find:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

result.contents:
[u'Lorem ipsum', <br/>, u'dolor sit amet,', <br/>, u'consectetur...']

result.get_text():
Lorem ipsumdolor sit amet,consectetur...

After replacing all the 'None' with ' ':
Lorem ipsum dolor sit amet, consectetur...

This is more elaborate than Sean's very compact solution - but since I had said I would create and test a solution along the lines I had indicate when I could, I decided to follow through on my promise. You can see a little better what is going on here - the <br/> is its own element in the result.contents tuple, but when converted to string there's "nothing left".

Upvotes: 19

Sean Vieira
Sean Vieira

Reputation: 160015

If you are using bs4 you can use strings:

" ".join(result.strings)

Upvotes: 25

Related Questions