Bobafotz
Bobafotz

Reputation: 65

Regular expression and unicode to extract a price in python

I'm trying to extract some prices from the ikea website but the price format is pretty messy (whitespace, carriage return, a comma in the middle of nowhere). This is what I extracted :

        39,90 €
                            ,

I used Scrapy to do this, so far no problem, except that I would like to get rid of all of what is not the price (and the euro symbol) !

I tried to use this regex (in python 2.7) :

re(\S[0-9]+([ ,]?[ ])([0-9]{2}?)u"\u20AC")

I'm new in programming and I learned what is a regular expression this afternoon, but I tried a massive number of possibilities without getting any better results than :

SyntaxError: unexpected character after line continuation character

If someone could take few minutes to look at what I did and tells me where I'm wrong, that would be great !

Cheers everyone

Upvotes: 1

Views: 1329

Answers (1)

Žilvinas Rudžionis
Žilvinas Rudžionis

Reputation: 2346

What type of strings you are trying to match unicode or byte?

Suppose you are working with unicode strings then your match could look like:

#!/usr/bin/python
import re

s = u"""        39,90 \u20AC
                  """
groups = re.match(ur'\D*(\d+)\D*(\d{0,2})\D*(\u20AC)', s, re.UNICODE)
print groups.groups()

output:

(u'39', u'90', u'\u20ac')

u in front of strings indicates that this is unicode string.

Regex explained:

  1. \D* - anything that is non digit zero or more times
  2. (\d+) - one or more digits
  3. \D* - ...
  4. (\d{0,2}) - zero or two digits
  5. \D* - ...
  6. (\u20AC) - unicode currency symbol

We use \D, \d along with re.UNICODE flag so that everything that in unicode is interpreted as digit or non digit would be matched.

If you use byte strings. I assume that you are working with utf-8 byte strings. Then:

import re

s = b"""        39,90 \xE2\x82\xAC
                  """

groups = re.match(r'\D*(\d+)\D*(\d{0,2})\D*(\xE2\x82\xAC)', s)
print groups.groups()

output:

('39', '90', '\xe2\x82\xac')

"\xe2\x82\xac" is "e282ac" byte sequence that in utf-8 encoding means euro sign.

Good practise called "Unicode sandwich":

  1. Decode bytes to unicode on input
  2. Work only with unicode
  3. Encode unicode to bytes on output

Upvotes: 1

Related Questions