I Like
I Like

Reputation: 1847

Pulling Text from Type 'Navigable String' and 'Tag' on Beautiful Soup

I'm stuck on parsing part of Rotten Tomatoes website that has the critics score as a tag and the "%" separately. I followed some SO suggestions such as using find_all('span',text="true"), but Python 3.5.1 shell returned this error: AttributeError: 'NavigableString' object has no attribute 'find_all' I also tried finding the direct child of Beautiful Soup object critiscore, but received the same error. Please tell me where I went wrong. Here's my python code:

def get_rating(address):
    """pull ratings numbers from rotten tomatoes"""
    RTaddress = urllib.request.urlopen(address)
    tomatoe = BeautifulSoup(RTaddress, "lxml")
    for criticscore in tomatoe.find('span', class_=['meter-value superPageFontColor']):
        print(''.join(criticscore.find_all('span', recursive=False))) #print the Tomatometer

Also, here's the code on Rotten Tomatoes I'm interested in scraping:

<div class="critic-score meter">
                        <a href="#contentReviews" class="unstyled articleLink" id="tomato_meter_link">
                            <span class="meter-tomato icon big medium-xs certified_fresh pull-left"></span>
                            <span class="meter-value superPageFontColor"><span>96</span>%</span>
                        </a>
                    </div>

Upvotes: 1

Views: 3159

Answers (1)

alecxe
alecxe

Reputation: 474031

The problem line is this one:

for criticscore in tomatoe.find('span', class_=['meter-value superPageFontColor']):

Here, you are locating a single element via find() and then iterate over its children which can be the text nodes as well as other elements (when you iterate over an element, this is what happens in BeautifulSoup).

Instead, you probably meant to use find_all() instead of find():

for criticscore in tomatoe.find_all('span', class_=['meter-value superPageFontColor']):

Or, you can use a single CSS selector instead:

for criticscore in tomatoe.select('span.meter-value > span'):
    print(criticscore.get_text())

where > means a direct parent-child relationship (this is your recursive=False replacement).

Upvotes: 2

Related Questions