Reputation: 11
here is the context : I am trying to retrieve several dates on a website, and get them in a "doubled list" like this :
[[day1start,month1start,year1start,day1end,month1end,year1end], [day2,month2,year2, None, None, None]]
Note : in the second date, the date to be retrieve is only one day.
So the dates are organized in the HTML Code as follow :
<div class= "eventDate">
<div class="date">
13-Sep-2014
- 14-Sep-2014
</div>
<div class="date">
05-Jul-2014
</div>
<div class="date">
09-Aug-2014
</div></div>
So here I would like to use XPATH and REGEX to get exactly as following :
[
["13", "Sep", "2014", "14", "Sep", "2014"],
["05", "Jul", "2014", , , ],
["09", "Aug", "2014", , , ]
]
So I am using this Xpath to get what I want :
dates = response.xpath('//div[@class="eventDate"]/div[@class="date"]/p[1]/text()')
it returns me the following when I "print dates" :
[<Selector xpath='//div[@class="eventDate"]/div[@class="date"]/p[1]/text()' data=u'\n\t\t\t13-Sep-2014\n\t\t\t\t- 13-Sep-2014\n\t\t'>, <Selector xpath='//div[@class="eventDate"]/div[@class="date"]/p[1]/text()' data=u'\n\t\t\t05-Jul-2014\n\t\t'>, <Selector xpath='//div[@class="eventDate"]/div[@class="date"]/p[1]/text()' data=u'\n\t\t\t09-Aug-2014\n\t\t'>]
So I am using THIS REGEX, that IS PERFECTLY working on Rubular, Pythex AND Pythonregex.com !!!!
(\d{2})-(\w{3})-(\d{4})(?:.*?-.*?(\d{2})-(\w{3})-(\d{4}))?
I noticed the problem came from the ".*" which were not taking into account the "\n"
So I changed it to :
(\d{2})-(\w{3})-(\d{4})(?:[\s\S]*?-[\s\S]*?(\d{2})-(\w{3})-(\d{4}))?`
Which give the final code :
dates = response.xpath('//div[@class="eventDate"]/div[@class="date"]/p[1]/text()').re("(\d{2})-(\w{3})-(\d{4})(?:[\s\S]*?-[\s\S]*?(\d{2})-(\w{3})-(\d{4}))?")
My problem : I get the following table which is not what I want :
[u'13', u'Sep', u'2014', u'13', u'Sep', u'2014', u'05', u'Jul', u'2014', u'', u'', u'', u'09', u'Aug', u'2014', u'', u'', u'']
Pb is it is in a simple table.. I need it in a double. And it is perfectly workinf on Rubular, Pythex and Pythonregex ! Just not in Scrapy...
1/ Please Help !!!
Also I have sides questions : 2/ How to make the dot match newlines also in Scrapy ? 3/ I noticed that the regex in Scrapy was Non Greedy by default... Is that true and why ?
Upvotes: 1
Views: 947
Reputation: 67968
See here http://regex101.com/r/lS5tT3/77
Your regex is working fine.You need to set s
flag or re.DOTALL
to capture \n
.
Upvotes: 1