Elena ZdeG
Elena ZdeG

Reputation: 1

python: exclude string regular expression

i'm trying to build a web scraper to get prices off http://fetch.co.uk/dogs/dog-food?per-page=20

I have the code here below:

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
    print(wrap.find("",{"itemprop": re.compile("shelf-product__price.*(?!cut).*")}).get_text())
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

In every wrap, there are sometimes 2 different prices and I am trying to exclude the cut price and get only the price below that one (the promo price).

i cannot figure out how to exclude the price with cut, the expression above does not work.

"shelf-product__price shelf-product__price--cut [ v2 ]"
"shelf-product__price shelf-product__price--promo [ v2 ]"

I have used the workaround below but i'd like to understand what i am getting wrong in the regular expression. sorry if the code is not pretty, i'm learning

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(url above)
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")})
for wrap in wrapList:
    print(wrap.find("",{"itemprop": re.compile("price.*")}).get_text())
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

Upvotes: 0

Views: 246

Answers (2)

Learner
Learner

Reputation: 5292

Why to use that complex code you may try below- span[itemprop=price] means select all span that have properties itemprop is price.

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

#get possible list of urls
urls = ['http://fetch.co.uk/dogs/dog-food?per-page=%s'%n for n in range(1,100)]

for url in urls:
  html = urlopen(url)
  bsObj = BeautifulSoup(html,"html.parser")
  for y in [i.text for i in bsObj.select("span[itemprop=price]")]:
    print y.encode('utf-8')

Upvotes: 0

Trevor Merrifield
Trevor Merrifield

Reputation: 4691

There are a few problems. The first is that .*(?!cut).* is equivalent to .*. This is because the first .* consumes all of the remaining characters. Then of course the (?!cut) check passes since it's at the end of the string. Finally .* consumes 0 characters. So it's always a match. This regex would give you false positives. The only reason it gives you nothing is that you are looking in itemprop when the text you're looking for is in class.

Your workaround looks good to me. But if you wanted to base your search on classes I would do it like this.

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://fetch.co.uk/dogs/dog-food?per-page=20')
bsObj = BeautifulSoup(html,"html.parser")

wrapList = bsObj.findAll("",{"class": "shelf-product__self"})

def is_price(tag):
    return tag.has_attr('class') and \
           'shelf-product__price' in tag['class'] and \
           'shelf-product__price--cut' not in tag['class']

for wrap in wrapList:
    print(wrap.find(is_price).text)
    x=wrap.find("",{"class": "shelf-product__title"}).get_text()

Regular expressions are fine but I think it's easier to do boolean logic with booleans.

Upvotes: 1

Related Questions