Demandar
Demandar

Reputation: 61

Scrapy: Can't strip unicode from my item data (price)

I'm building a scraper to get the product prices from a website.

At the moment I have the following code:

def parse(self, response):
    for tank in response.xpath('//html/body/div/div[4]/div/div/div/table[1]/tr/td/div/span/span'):
        item = VapeItem()
        item["price"] = tank.xpath("text()").extract()
        yield item

And here is the json output:

{"price": ["5,00 \u20ac\n  \n    \n  \n  \n  \n      *\n    \n  \n  \n    "]},

I've tried encoding("utf-8"), strip, replaces, but nothing seems to work.

My question is: How do I make that output readable. Either make "5.00 €" ( \u20ac) or just "5.00"

Thanks in advance!

Upvotes: 4

Views: 793

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180481

Simplest way may be to split once and replace any comma with a decimal:

item["price"] = tank.xpath("text()").extract()[0].split(None,1)[0].replace(",",".")

That will leave you with 5.00. Because you have a * in the string strip would not work, you could pass that character to strip i,e [0].rstrip("\n* ") but if there were other errant chars that would break.

If you want the euro sign too, you can decode('unicode-escape'):

d={"price": ["5,00 \u20ac\n  \n    \n  \n  \n  \n      *\n    \n  \n  \n    "]}

d["price"] = d["price"][0].decode('unicode-escape').rstrip("\n * ").replace(",",".")

print(d["price"])
5.00 €

If you want to combine it with split and keep the sign, also formatting it a bit nicer:

p,s,_ = d["price"][0].split(None, 2)

d["price"] = u"{}{}".format(s.decode("unicode_escape"),p.replace(",","."))

print(d["price"])

Which will give you:

€5.00

Upvotes: 2

Related Questions