Trouble with Scrapy Regex on one site likely not using normal encoding

Question

So for the 20+ site i have this spider crawling through, all of the price items yield just fine... however, on this one specific site (https://www.garrafeiranacional.com/) there is a very annoying problem..

When I go to extract the price information from specific products, what is naturally returned without any MapCompose/Regex cleansing is something like this:

'14,55\xa0€', or, more annoyingly:
'9\xa0600,00\xa0€'

Before you ask I have tried just about every combination i can think of, normally I would do something like:

productLoader.add_xpath('blah', 'blah', MapCompose(lambda i: i.lstrip(punctuation)
    .strip().replace('"', '').replace('.', ','), re = '[^\d]+'))

even trying regexes like:

'\b\d[\d,.]*\b'
and countless others, both inside the MapCompose, and without MapCompose

I even tried using re.sub() within the MapCompose, like :

"(?<=\d)\S+" (a positive lookahead to capture everything after the last digit)

So I assume that the issue is the encoding this website uses, as one bit of info I was scraping on the site, which had a space in it, would yield the same weird \xa0 string... I tried stripping away whitespace from the price, but nothing appears to be doing the trick. if anyone has any ideas on where I should look, etc. would love to hear them

Trouble with Scrapy Regex on one site likely not using normal encoding

Answers (1)

Related Questions