Reputation: 23
I'm having trouble getting scrapy to return the text from this div. When it does return data its considerably more than what I thought it would return.
Target HTML:
<div class="DivTimeSpan" title="Full Time">12:00 PM - 09:00 PM </div>
Attempt 1:
def parse_schedule(self, response):
s_item = ScheduleItem()
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = sel.select('//text()').extract()
return s_item
Returns:
"\r\n\r\n ", "\r\n ", "\r\n \r\n\r\n var allowedUrls = [];\r\n allowedUrls.push(\"Login.net\");\r\n allowedUrls.push(\"Login\");\r\n allowedUrls.push(\"AccountLogin.net\");\r\n allowedUrls.push(\"AccountLogin\");\r\n allowedUrls.push(\"CreateAccount\");\r\n allowedUrls.push(\"CreateAccount.net\");\r\n allowedUrls.push(\"UpdateAccount\");\r\n allowedUrls.push(\"UpdateAccount.net\");\r\n allowedUrls.push(\"CreateResellersAccount\");\r\n allowedUrls.push(\"CreateResellersAccount.net\");\r\n allowedUrls.push(\"CreateQqestSAASAccount\");\r\n
"11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"
The entire file is probably thousands of lines long and contains what looks like text from outside of the div I specified
I understood //text() to return the text of the element and its children. The html element I'm targeting doesn't have any children though so I assumed it would only return the data in the div.
Next I tried just using "/text()". This was the only change
Attempt 2:
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = sel.select('/text()').extract()
return s_item
Returns:
[{"schedule": []}]
Desired Result:
[{"schedule": ["11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"]}]
The url I'm scraping is behind a company login so I can't give out the actual url.
Elisha's post lead me in the right direction, Thanks!!! :) Answer:
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = map(unicode.strip, sel.select('//div/text()').extract())
return s_item
Upvotes: 1
Views: 571
Reputation: 23790
The second attempt is closer to extracting the value. Yet, you need to extract the text from the node, and not from the document root:
s_item['schedule'] = sel.select('/div/text()').extract()[0]
In case the document contains more tags (which are not divs), you can try:
s_item['schedule'] = sel.select('//div/text()').extract()[0]
Upvotes: 1