Reputation: 49
I am a new python user banging my head against a wall on a BS issue. My target page contains the snipits below:
<div class=rbHeader>
<span role="heading" aria-level="3" class="ws_bold">
Experience Level</span>
</div>
<div class=" row result" id="p_bc0437dce636c6f4" data-jk="bc0437dce636c6f4" itemscope itemtype="http://schema.org/JobPosting" data-tn-component="organicJob">
...
</div>
I have parsed the page as follows:
target = Soup(urllib.urlopen(url), "lxml")
If I run
targetElements = target.findAll('div', attrs={'class':'rbheader'})
print targetElements
I get
[<div class="rbHeader">\n<span aria-level="3" class="ws_bold" role="heading">\nExperience Level</span>\n</div>]
but if I run
targetElements = target.findAll('div', attrs={'class':' row result'})
print targetElements
i get
[]
This is the case no matter which class i try to select if that class is in quotes. i can only seem to find classes that are outside of quotes.
Any help would be greatly appreciated.
Best Ryan
Upvotes: 0
Views: 478
Reputation: 6556
Here is an example based on your div
:
div_test='<div class=rbHeader><span role="heading" aria-level="3" class="ws_bold">Experience Level</span></div><div class=" row result" id="p_bc0437dce636c6f4" data-jk="bc0437dce636c6f4" itemscope itemtype="http://schema.org/JobPosting" data-tn-component="organicJob"></div>'
target = bs4.BeautifulSoup(div_test,'html.parser')
1, class name is case sensitive, your code
targetElements = target.findAll('div', attrs={'class':'rbheader'})
print targetElements
will get nothing []
.
targetElements = target.findAll('div', attrs={'class':'rbHeader'})
print targetElements
Will give you:
[<div class="rbHeader"><span aria-level="3" class="ws_bold" role="heading">Experience Level</span></div>]
2, For the code:
targetElements = target.findAll('div', attrs={'class':' row result'})
print targetElements
It will give you the result instead of nothing:
[<div class=" row result" data-jk="bc0437dce636c6f4" data-tn-component="organicJob" id="p_bc0437dce636c6f4" itemscope="" itemtype="http://schema.org/JobPosting"></div>]
Upvotes: 0
Reputation: 8946
Spaces are stripped from all classes, always.
You can just get one class:
targetElements = target.findAll('div', attrs={'class':'row'})
...or:
targetElements = target.findAll('div', attrs={'class':'result'})
If you are suspicious that each of these may return too many results, you can do:
soup.select('div.row.result')
....where soup
is your instance.
Upvotes: 1