Reputation: 21341
I am scraping this URL.
I have to scrape the main content of the page like Room Features and Internet Access
Here is my code:
for h3s in Column: # Suppose this is div.RightColumn
for index,test in enumerate(h3s.select("h3")):
print("Feature title: "+str(test.text))
for v in h3s.select("ul")[index]:
print(v.string.strip())
This code scrapes all the <li>
's but when it comes to scrape Internet Access
I get
AttributeError: 'NoneType' object has no attribute 'strip'
Because <li>
s data under the Internet Access heading is contained inside the double-quotes like "Wired High Speed Internet Access..."
I have tried replacing print(v.string.strip())
with print(v)
which results <li>Wired High...</li>
Also I have tried using print(v.text)
but it does not work too
The relevant section looks like:
<h3>Internet Access</h3>
<ul>
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
</ul>
Upvotes: 0
Views: 800
Reputation: 1124518
BeautifulSoup elements only have a .string
value if that string is the only child in the element. Your <li>
tag has a <span>
element as well as a text.
Use the .text
attribute instead to extract all strings as one:
print(v.text.strip())
or use the element.get_text()
method:
print(v.get_text().strip())
which also takes a handy strip
flag to remove extra whitespace:
print(v.get_text(' ', strip=True))
The first argument is the separator used to join the various strings together; I used a space here.
Demo:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <h3>Internet Access</h3>
... <ul>
... <li>Wired High Speed Internet Access in All Guest Rooms
... <span class="fee">
... 25 USD per day
... </span>
... </li>
... </ul>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.li
<li>Wired High Speed Internet Access in All Guest Rooms
<span class="fee">
25 USD per day
</span>
</li>
>>> soup.li.string
>>> soup.li.text
u'Wired High Speed Internet Access in All Guest Rooms\n \n 25 USD per day\n \n'
>>> soup.li.get_text(' ', strip=True)
u'Wired High Speed Internet Access in All Guest Rooms 25 USD per day'
Do make sure you call it on the element:
for index, test in enumerate(h3s.select("h3")):
print("Feature title: ", test.text)
ul = h3s.select("ul")[index]
print(ul.get_text(' ', strip=True))
You could use the find_next_sibling()
function here instead of indexing into a .select()
:
for header in h3s.select("h3"):
print("Feature title: ", header.text)
ul = header.find_next_sibling("ul")
print(ul.get_text(' ', strip=True))
Demo:
>>> for header in h3s.select("h3"):
... print("Feature title: ", header.text)
... ul = header.find_next_sibling("ul")
... print(ul.get_text(' ', strip=True))
...
Feature title: Room Features
Non-Smoking Room Connecting Rooms Available Private Terrace Sea View Room Suites Available Private Balcony Bay View Room Honeymoon Suite Starwood Preferred Guest Room Room with Sitting Area
Feature title: Internet Access
Wired High Speed Internet Access in All Guest Rooms 25 USD per day
Upvotes: 1