kamen1111
kamen1111

Reputation: 175

Python & Beautifulsoup 4 - Unable to filter classes?

I'm trying to scrape shoe sizes from this URL: http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey

What I'm trying to do is get only the sizes that are available, e.g. only those that aren't greyed out.

The sizes are all wrapped in a elements. The available sizes are of box class, and the unavailable ones are of box piunavailable class.

I have tried using a lambda function, ifs and CSS selectors - none seem to work. My guess it's because of the way my code is structured.

The way it's structured is as follows:

The if attempt

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a', attrs={'class': 'box'}) if 'piunavailable' not in e.attrs['class']])

The lambda attempt

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll(lambda tag: tag.name == 'a' and tag.get('class') == ['box piunavailable'])])

The CSS selector attempt

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a[class="box"]'))

So, for the URL provided, I am expecting the results to be a string (converted from list) that is all available sizes - at the time of writing this question, it should be - '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13'

Instead, I'm getting all sizes, '7.5', '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '12', '13'

Anyone have an idea how to make it work (or know an elegant solution to my issue)? Thank you in advance!

Upvotes: 2

Views: 64

Answers (2)

QHarr
QHarr

Reputation: 84455

You want a css :not pseudo class selector to exclude the other class. Using bs4 4.7.1.

sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]

In full:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey')  
soup = BeautifulSoup(r.content,'lxml')  
sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]
print(sizes)

Upvotes: 1

Bitto
Bitto

Reputation: 8205

What is you are asking for is to get the a tags with a specific class box and no other classes. This can be accomplished via passing a custom function as filter to find_all.

def my_match_function(elem):
 if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
     return True

Here ''.join(elem.attrs.get('class',''))=='box' ensures that the a tag has only class box and no other class.

Let's see this in action

from bs4 import BeautifulSoup,Tag
html="""
<a>This is also not needed.</a>
<div class="box_wrapper">
<a id="itemcode_11398535" class="box piunavailable">7.5</a>
<a href="#" id="itemcode_11398536" class="box">8</a>
<a href="#" id="itemcode_11398537" class="box">8.5</a>
<a href="#" id="itemcode_11398538" class="box">9</a>
<a href="#" id="itemcode_11398539" class="box">9.5</a>
<a href="#" id="itemcode_11398540" class="box">10</a>
<a href="#" id="itemcode_11398541" class="box">10.5</a>
<a href="#" id="itemcode_11398542" class="box">11</a>
<a href="#" id="itemcode_11398543" class="box">11.5</a>
<a id="itemcode_11398544" class="box piunavailable">12</a>
<a href="#" id="itemcode_11398545" class="box">13</a>
</div>
"""
def my_match_function(elem):
 if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
     return True
soup=BeautifulSoup(html,'html.parser')
my_list=[x.text for x in soup.find_all(my_match_function)]
print(my_list)

Outputs:

['8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13']

Upvotes: 1

Related Questions