Reputation: 65
I'm trying to extract some prices from the ikea website but the price format is pretty messy (whitespace, carriage return, a comma in the middle of nowhere). This is what I extracted :
39,90 €
,
I used Scrapy to do this, so far no problem, except that I would like to get rid of all of what is not the price (and the euro symbol) !
I tried to use this regex (in python 2.7) :
re(\S[0-9]+([ ,]?[ ])([0-9]{2}?)u"\u20AC")
I'm new in programming and I learned what is a regular expression this afternoon, but I tried a massive number of possibilities without getting any better results than :
SyntaxError: unexpected character after line continuation character
If someone could take few minutes to look at what I did and tells me where I'm wrong, that would be great !
Cheers everyone
Upvotes: 1
Views: 1329
Reputation: 2346
What type of strings you are trying to match unicode or byte?
Suppose you are working with unicode strings then your match could look like:
#!/usr/bin/python
import re
s = u""" 39,90 \u20AC
"""
groups = re.match(ur'\D*(\d+)\D*(\d{0,2})\D*(\u20AC)', s, re.UNICODE)
print groups.groups()
output:
(u'39', u'90', u'\u20ac')
u in front of strings indicates that this is unicode string.
Regex explained:
We use \D, \d along with re.UNICODE flag so that everything that in unicode is interpreted as digit or non digit would be matched.
If you use byte strings. I assume that you are working with utf-8 byte strings. Then:
import re
s = b""" 39,90 \xE2\x82\xAC
"""
groups = re.match(r'\D*(\d+)\D*(\d{0,2})\D*(\xE2\x82\xAC)', s)
print groups.groups()
output:
('39', '90', '\xe2\x82\xac')
"\xe2\x82\xac" is "e282ac" byte sequence that in utf-8 encoding means euro sign.
Good practise called "Unicode sandwich":
Upvotes: 1