Ryan Lewis
Ryan Lewis

Reputation: 49

Beautiful Soup: findall and quoted classes

I am a new python user banging my head against a wall on a BS issue. My target page contains the snipits below:

<div class=rbHeader>
<span role="heading" aria-level="3" class="ws_bold">
Experience Level</span>
</div>

<div class="  row  result" id="p_bc0437dce636c6f4" data-jk="bc0437dce636c6f4" itemscope itemtype="http://schema.org/JobPosting" data-tn-component="organicJob">

...

</div>

I have parsed the page as follows:

   target = Soup(urllib.urlopen(url), "lxml") 

If I run

targetElements = target.findAll('div', attrs={'class':'rbheader'})
print targetElements

I get

 [<div class="rbHeader">\n<span aria-level="3" class="ws_bold" role="heading">\nExperience Level</span>\n</div>]

but if I run

targetElements = target.findAll('div', attrs={'class':'  row  result'})
print targetElements

i get

[]

This is the case no matter which class i try to select if that class is in quotes. i can only seem to find classes that are outside of quotes.

Any help would be greatly appreciated.

Best Ryan

Upvotes: 0

Views: 478

Answers (2)

Tiny.D
Tiny.D

Reputation: 6556

Here is an example based on your div:

div_test='<div class=rbHeader><span role="heading" aria-level="3" class="ws_bold">Experience Level</span></div><div class="  row  result" id="p_bc0437dce636c6f4" data-jk="bc0437dce636c6f4" itemscope itemtype="http://schema.org/JobPosting" data-tn-component="organicJob"></div>'
target = bs4.BeautifulSoup(div_test,'html.parser')

1, class name is case sensitive, your code

targetElements = target.findAll('div', attrs={'class':'rbheader'})
print targetElements

will get nothing [].

targetElements = target.findAll('div', attrs={'class':'rbHeader'})
print targetElements

Will give you:

[<div class="rbHeader"><span aria-level="3" class="ws_bold" role="heading">Experience Level</span></div>]

2, For the code:

targetElements = target.findAll('div', attrs={'class':'  row  result'})
print targetElements

It will give you the result instead of nothing:

[<div class=" row result" data-jk="bc0437dce636c6f4" data-tn-component="organicJob" id="p_bc0437dce636c6f4" itemscope="" itemtype="http://schema.org/JobPosting"></div>]

Upvotes: 0

JacobIRR
JacobIRR

Reputation: 8946

Spaces are stripped from all classes, always.

You can just get one class:

targetElements = target.findAll('div', attrs={'class':'row'})

...or:

targetElements = target.findAll('div', attrs={'class':'result'})

If you are suspicious that each of these may return too many results, you can do:

soup.select('div.row.result')

....where soup is your instance.

Upvotes: 1

Related Questions