Reputation: 3067
I'm new to Python and trying to parse a simple HTML. However, one thing stops me: for example, I have this html:
<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>
I need to extract text from second div (text). This way I get it:
print repr(link.find('div').findNextSibling())
However, this returns the whole div (with "div" word): <div class="text">Here's the desired text!</div>
And I don't know how to get text only.
.text
results in \u043a\u0430\u043a \u0440\u0430\u0437\u0440\u0430\u0431
strings\.strings
returns "None"
.string
returns both "None"
and \u042f\u0445\u0438\u043a\u043e - \u0435\u0441\u043b\u0438\
Maybe there's something wrong with repr
P.S. I need to save tags inside div
too.
Upvotes: 0
Views: 37
Reputation: 36282
Why don't you simply search the <div>
element based in its class
attribute? Something like the following seems to work for me:
from bs4 import BeautifulSoup
html = '''<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>'''
link = BeautifulSoup(html, 'html')
print link.find('div', class_="text").text.strip()
It yields:
Here's the desired text!
Upvotes: 1