Reputation: 1
i'm trying to build a web scraper to get prices off http://fetch.co.uk/dogs/dog-food?per-page=20
I have the code here below:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")
wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
print(wrap.find("",{"itemprop": re.compile("shelf-product__price.*(?!cut).*")}).get_text())
print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())
In every wrap, there are sometimes 2 different prices and I am trying to exclude the cut price and get only the price below that one (the promo price).
i cannot figure out how to exclude the price with cut, the expression above does not work.
"shelf-product__price shelf-product__price--cut [ v2 ]"
"shelf-product__price shelf-product__price--promo [ v2 ]"
I have used the workaround below but i'd like to understand what i am getting wrong in the regular expression. sorry if the code is not pretty, i'm learning
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")
wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
print(wrap.find("",{"itemprop": re.compile("price.*")}).get_text())
print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())
Upvotes: 0
Views: 246
Reputation: 5292
Why to use that complex code you may try below- span[itemprop=price]
means select all span
that have properties itemprop
is price
.
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
#get possible list of urls
urls = ['http://fetch.co.uk/dogs/dog-food?per-page=%s'%n for n in range(1,100)]
for url in urls:
html = urlopen(url)
bsObj = BeautifulSoup(html,"html.parser")
for y in [i.text for i in bsObj.select("span[itemprop=price]")]:
print y.encode('utf-8')
Upvotes: 0
Reputation: 4691
There are a few problems. The first is that .*(?!cut).*
is equivalent to .*
. This is because the first .*
consumes all of the remaining characters. Then of course the (?!cut)
check passes since it's at the end of the string. Finally .*
consumes 0 characters. So it's always a match. This regex would give you false positives. The only reason it gives you nothing is that you are looking in itemprop
when the text you're looking for is in class
.
Your workaround looks good to me. But if you wanted to base your search on classes I would do it like this.
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://fetch.co.uk/dogs/dog-food?per-page=20')
bsObj = BeautifulSoup(html,"html.parser")
wrapList = bsObj.findAll("",{"class": "shelf-product__self"})
def is_price(tag):
return tag.has_attr('class') and \
'shelf-product__price' in tag['class'] and \
'shelf-product__price--cut' not in tag['class']
for wrap in wrapList:
print(wrap.find(is_price).text)
x=wrap.find("",{"class": "shelf-product__title"}).get_text()
Regular expressions are fine but I think it's easier to do boolean logic with booleans.
Upvotes: 1