Michael
Michael

Reputation: 267

Scraping Value after Euro Symbol (Scrapy-Python)

i need the a selector to scrape the value after the euro symbol (\u20ac).

<Selector xpath='//*[@class="col-sm-4"]/text()' data=u'\r\n\t\t            \u20ac 30.000,00'>

I tried dozens of variations that i have found here on stackoverflow and elsewere but i cant get it.

Sides like https://regexr.com/ show me that something like this:

response.xpath('//*[@class="col-sm-4"]/text()').re('(\u20ac).\d*.\d*.\d*')

should work, but it doesnt.

EDIT: Here a example link of Data that i would like to scrape: https://www.firmenabc.at/manfred-jungwirth-montagen_MoKY

Would appreciate help!

Michael

Upvotes: 1

Views: 369

Answers (2)

Wilfredo
Wilfredo

Reputation: 1548

Try this:

response.xpath('//*[@class="col-sm-4"]/text()').re(u'\u20ac\s*(\d+[\d\.,]+)')

Upvotes: 0

alexisdevarennes
alexisdevarennes

Reputation: 5642

Here is the regex you are looking for. If you want to match \u20ac literally you need to prefix it with a \, the following variant: \u20ac|\\u20ac will match both € and \u20ac:

(\u20ac|\\u20ac)\s+.\d*.\d*.\d*

Missing was also a \s+. \s specifies you want to match a white space, \s+ specifies you want to match multiple white space (notice there is white space between \u20ac and the value, 30.000,00)

Notice though that this will capture only the symbol (capture groups are composed of closed parentheses (), i.e. (ANYTHING BETWEEN THIS WILL BE CAPTURED)

So I believe what you want is:

\u20ac|\\u20ac\s+(\d*.*) - Here, we're surrounding .\d*.* with () therefore capturing that value instead of the symbol.

Repeating .\d* is redundant, you already indicated you want to match every ocassion of it by writing it previously: \d and suffixing it a *.

Lastly, I recommend you play around with regex using https://www.regex101.com - It's a great tool and will save you a lot of headache.

Upvotes: 1

Related Questions