Reputation: 2198
I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is:
<div class="booker-booking">
2 rooms
·
USD 0
<!-- Commission: USD -->
</div>
The snippet from python I have is:
data = soup.find('div', class_='booker-booking').string
I've also tried the following two:
data = soup.find('div', class_='booker-booking').text
data = soup.find('div', class_='booker-booking').contents[0]
Which both return:
u'\n\t\t2\xa0rooms \n\t\t\xb7\n\t\tUSD\xa00\n\t\t\n
I'm ultimately trying to get the first line into a variable just saying "2 Rooms", and the third line into another variable just saying "USD 0".
Upvotes: 3
Views: 10658
Reputation: 414079
.string
returns None
because the text node is not the only child (there is a comment).
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html)
div = soup.find('div', 'booker-booking')
# remove comments
text = " ".join(div.find_all(text=lambda t: not isinstance(t, Comment)))
# -> u'\n 2\xa0rooms\n \xb7\n USD\xa00\n \n'
To remove Unicode whitespace:
text = " ".join(text.split())
# -> u'2 rooms \xb7 USD 0'
print text
# -> 2 rooms · USD 0
To get your final variables:
var1, var2 = [s.strip() for s in text.split(u"\xb7")]
# -> u'2 rooms', u'USD 0'
Upvotes: 5
Reputation: 463
After you have done data = soup.find('div', class_='booker-booking').text
you've extracted the data you need from the HTML. Now you just need to format it to get "2 Rooms" and "USD 0. The first step is probably splitting the data by line:
import string
lines = string.split(data, '\n')
Which will give [u'', u'\t\t2\xa0rooms ', u'\t\t\xb7', u'\t\tUSD\xa00', u'\t\t', u'']
Now you need to get rid of the whitespace, unescape the html characters, and remove the lines that don't have data:
import HTMLParser
h = HTMLParser.HTMLParser()
formatted_lines = [string.strip(h.unescape(line)) for line in lines if len(line) > 3]
You will be left with the data you want:
print formatted_lines[0]
#2 rooms
print formatted_lines[1]
#USD 0
Upvotes: 0