Unable to scrape the text from a certain LI element

Question

I am scraping this URL.

I have to scrape the main content of the page like Room Features and Internet Access

Here is my code:

 for h3s in Column:  # Suppose this is div.RightColumn 
    for index,test in enumerate(h3s.select("h3")):
        print("Feature title: "+str(test.text))
        for v in h3s.select("ul")[index]:
            print(v.string.strip())

This code scrapes all the

's but when it comes to scrape Internet Access I get

AttributeError: 'NoneType' object has no attribute 'strip'

Because

s data under the Internet Access heading is contained inside the double-quotes like "Wired High Speed Internet Access..."

I have tried replacing print(v.string.strip()) with print(v) which results

Wired High...

Also I have tried using print(v.text) but it does not work too

The relevant section looks like:

Internet Access

    Wired High Speed Internet Access in All Guest Rooms
     
        25 USD per day

Martijn Pieters · Accepted Answer

BeautifulSoup elements only have a .string value if that string is the only child in the element. Your

tag has a element as well as a text.

Use the .text attribute instead to extract all strings as one:

print(v.text.strip())

or use the element.get_text() method:

print(v.get_text().strip())

which also takes a handy strip flag to remove extra whitespace:

print(v.get_text(' ', strip=True))

The first argument is the separator used to join the various strings together; I used a space here.

Demo:

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... Internet Access
... 
...     Wired High Speed Internet Access in All Guest Rooms
...      
...         25 USD per day
...      
...    
...  
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.li
Wired High Speed Internet Access in All Guest Rooms
     
        25 USD per day
     

>>> soup.li.string
>>> soup.li.text
u'Wired High Speed Internet Access in All Guest Rooms
     
        25 USD per day
     
'
>>> soup.li.get_text(' ', strip=True)
u'Wired High Speed Internet Access in All Guest Rooms 25 USD per day'

Do make sure you call it on the element:

for index, test in enumerate(h3s.select("h3")):
    print("Feature title: ", test.text)
    ul = h3s.select("ul")[index]
    print(ul.get_text(' ', strip=True))

You could use the find_next_sibling() function here instead of indexing into a .select():

for header in h3s.select("h3"):
    print("Feature title: ", header.text)
    ul = header.find_next_sibling("ul")
    print(ul.get_text(' ', strip=True))

Demo:

>>> for header in h3s.select("h3"):
...     print("Feature title: ", header.text)
...     ul = header.find_next_sibling("ul")
...     print(ul.get_text(' ', strip=True))
... 
Feature title: Room Features
Non-Smoking Room Connecting Rooms Available Private Terrace Sea View Room Suites Available Private Balcony Bay View Room Honeymoon Suite Starwood Preferred Guest Room Room with Sitting Area
Feature title: Internet Access
Wired High Speed Internet Access in All Guest Rooms 25 USD per day

Unable to scrape the text from a certain LI element

Answers (1)

Related Questions