Lancelot
Lancelot

Reputation: 3

Error accessing the class having hyphen(-) separated names in html file using BeautifulSoup

I am trying to scrape the data of popular english movies on Hotstar

I downloaded the html source code and I am doing this:

from bs4 import BeautifulSoup as soup
page_soup = soup(open('hotstar.html'),'html.parser')
containers = page_soup.findAll("div",{"class":"col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope"}) 
container = containers[0]
# To get video link
container.div.hs-cards-directive.article.a

I am getting an error at this point:

NameError: name 'cards' is not defined

These are the first few lines of the html file:

<div bindonce="" class="col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope" ng-repeat="slides in gridcardData">
<hs-cards-directive cdata="slides" class="ng-isolate-scope" renderingdone="shownCard()">
    <article class="card show-card" ng-class="{'live-sport-card':isLiveSportCard, 'card-active':btnRemoveShow,'tounament-tray-card':record.isTournament}" ng-click="cardeventhandler({cardrecord:record})" ng-init="init()" pdata="record" removecard="removecard" watched="watched">
        <a href="http://www.hotstar.com/movies/step-up-revolution/1770016594" ng-href="/movies/step-up-revolution/1770016594" restrict-anchor="">

Please help me out! I am using Python 3.6.3 on Windows.

Upvotes: 0

Views: 404

Answers (1)

abarnert
abarnert

Reputation: 365577

As (loosely) explained in the Going down section of the docs, the tag.descendant syntax is just a convenient shortcut for tag.find('descendant').

That shortcut can't be used in cases where you have tags whose names aren't valid Python identifiers.1 (Also in cases where you have tags whose names collide with methods of BS4 itself, like a <find> tag.)


Python identifiers can only have letters, digits, and underscores, not hyphens. So, when you write this:

container.div.hs-cards-directive.article.a

… python parses it like this mathematical expression:

container.div.hs - cards - directive.article.a

BeautifulSoup's div node has no descendant named hs, but that's fine; it just returns None. But then you try to subtract cards from that None, and you get a NameError.


Anyway, the only solution in this case is to not use the shortcut and call find explicitly:

container.div.find('hs-cards-directive').article.a

Or, if it makes sense for your use case, you can just skip down to article, because the shortcut finds any descendants, not just direct children:

container.div.article.a

But I don't think that's appropriate in your case; you want articles only under specific child nodes, not all possible articles, right?


1. Technically, it is actually possible to use the shortcut, it's just not a shortcut anymore. If you understand what getattr(container.div, 'hs-cards-directive').article.a means, then you can write that and it will work… but obviously find is going to be more readable and easier to understand.

Upvotes: 3

Related Questions