Reputation: 267

Scraping Value after Euro Symbol (Scrapy-Python)

i need the a selector to scrape the value after the euro symbol (\u20ac).

<Selector xpath='//*[@class="col-sm-4"]/text()' data=u'\r\n\t\t            \u20ac 30.000,00'>

I tried dozens of variations that i have found here on stackoverflow and elsewere but i cant get it.

Sides like https://regexr.com/ show me that something like this:

response.xpath('//*[@class="col-sm-4"]/text()').re('(\u20ac).\d*.\d*.\d*')

should work, but it doesnt.

EDIT: Here a example link of Data that i would like to scrape: https://www.firmenabc.at/manfred-jungwirth-montagen_MoKY

Would appreciate help!

Michael

Upvotes: 1

Answers (2)

Wilfredo

Reputation: 1548

Try this:

response.xpath('//*[@class="col-sm-4"]/text()').re(u'\u20ac\s*(\d+[\d\.,]+)')

Upvotes: 0

alexisdevarennes

Reputation: 5642

Here is the regex you are looking for. If you want to match \u20ac literally you need to prefix it with a \, the following variant: \u20ac|\\u20ac will match both € and \u20ac:

(\u20ac|\\u20ac)\s+.\d*.\d*.\d*

Missing was also a \s+. \s specifies you want to match a white space, \s+ specifies you want to match multiple white space (notice there is white space between \u20ac and the value, 30.000,00)

Notice though that this will capture only the € symbol (capture groups are composed of closed parentheses (), i.e. (ANYTHING BETWEEN THIS WILL BE CAPTURED)

So I believe what you want is:

\u20ac|\\u20ac\s+(\d*.*) - Here, we're surrounding .\d*.* with () therefore capturing that value instead of the € symbol.

Repeating .\d* is redundant, you already indicated you want to match every ocassion of it by writing it previously: \d and suffixing it a *.

Lastly, I recommend you play around with regex using https://www.regex101.com - It's a great tool and will save you a lot of headache.

Upvotes: 1

Scraping Value after Euro Symbol (Scrapy-Python)

Answers (2)

Related Questions