scrapy regex cannot find long dash

Question

I'm using scrapy xpath + re to extract data from web pages. Characters are unicode (russian) and all strings to be extracted contain long dashes (python code '\u2014') The problem is my regex cannot find a full string and splits it by long dash. It's really inconvenient for me. Here is some examples I've already tried and it didn't work:

response.xpath('some xpath goes here').re(r'[\w\s\u2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\u2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\x2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\uFFFF\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\.,—]+')
response.xpath('some xpath goes here').re(r'[\w\s\u(\w){4}\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s(\u(\d)){6}\.,]+')

Versions: Python 2.7, Scrapy 0.24.6

scrapy regex cannot find long dash

Answers (1)

Related Questions