Reputation: 175
I'm trying to scrape shoe sizes from this URL: http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey
What I'm trying to do is get only the sizes that are available, e.g. only those that aren't greyed out.
The sizes are all wrapped in a
elements. The available sizes are of box
class, and the unavailable ones are of box piunavailable
class.
I have tried using a lambda function, ifs and CSS selectors - none seem to work. My guess it's because of the way my code is structured.
The way it's structured is as follows:
The if
attempt
size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a', attrs={'class': 'box'}) if 'piunavailable' not in e.attrs['class']])
The lambda attempt
size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll(lambda tag: tag.name == 'a' and tag.get('class') == ['box piunavailable'])])
The CSS selector attempt
size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a[class="box"]'))
So, for the URL provided, I am expecting the results to be a string (converted from list) that is all available sizes - at the time of writing this question, it should be - '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13'
Instead, I'm getting all sizes, '7.5', '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '12', '13'
Anyone have an idea how to make it work (or know an elegant solution to my issue)? Thank you in advance!
Upvotes: 2
Views: 64
Reputation: 84455
You want a css :not pseudo class selector to exclude the other class. Using bs4 4.7.1.
sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]
In full:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey')
soup = BeautifulSoup(r.content,'lxml')
sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]
print(sizes)
Upvotes: 1
Reputation: 8205
What is you are asking for is to get the a
tags with a specific class box
and no other classes. This can be accomplished via passing a custom function as filter to find_all.
def my_match_function(elem):
if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
return True
Here ''.join(elem.attrs.get('class',''))=='box'
ensures that the a
tag has only class box
and no other class.
Let's see this in action
from bs4 import BeautifulSoup,Tag
html="""
<a>This is also not needed.</a>
<div class="box_wrapper">
<a id="itemcode_11398535" class="box piunavailable">7.5</a>
<a href="#" id="itemcode_11398536" class="box">8</a>
<a href="#" id="itemcode_11398537" class="box">8.5</a>
<a href="#" id="itemcode_11398538" class="box">9</a>
<a href="#" id="itemcode_11398539" class="box">9.5</a>
<a href="#" id="itemcode_11398540" class="box">10</a>
<a href="#" id="itemcode_11398541" class="box">10.5</a>
<a href="#" id="itemcode_11398542" class="box">11</a>
<a href="#" id="itemcode_11398543" class="box">11.5</a>
<a id="itemcode_11398544" class="box piunavailable">12</a>
<a href="#" id="itemcode_11398545" class="box">13</a>
</div>
"""
def my_match_function(elem):
if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
return True
soup=BeautifulSoup(html,'html.parser')
my_list=[x.text for x in soup.find_all(my_match_function)]
print(my_list)
Outputs:
['8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13']
Upvotes: 1