JNutt
JNutt

Reputation: 23

Can't get Scrapy to return text in Div

I'm having trouble getting scrapy to return the text from this div. When it does return data its considerably more than what I thought it would return.

Target HTML:

<div class="DivTimeSpan" title="Full Time">12:00 PM - 09:00 PM </div>

Attempt 1:

    def parse_schedule(self, response):
    s_item = ScheduleItem()

    for sel in response.xpath("//div[@class='DivTimeSpan']"):
        s_item['schedule'] = sel.select('//text()').extract()
    return s_item

Returns:

"\r\n\r\n ", "\r\n ", "\r\n \r\n\r\n var allowedUrls = [];\r\n allowedUrls.push(\"Login.net\");\r\n allowedUrls.push(\"Login\");\r\n allowedUrls.push(\"AccountLogin.net\");\r\n allowedUrls.push(\"AccountLogin\");\r\n allowedUrls.push(\"CreateAccount\");\r\n allowedUrls.push(\"CreateAccount.net\");\r\n allowedUrls.push(\"UpdateAccount\");\r\n allowedUrls.push(\"UpdateAccount.net\");\r\n allowedUrls.push(\"CreateResellersAccount\");\r\n allowedUrls.push(\"CreateResellersAccount.net\");\r\n allowedUrls.push(\"CreateQqestSAASAccount\");\r\n
"11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"

The entire file is probably thousands of lines long and contains what looks like text from outside of the div I specified

I understood //text() to return the text of the element and its children. The html element I'm targeting doesn't have any children though so I assumed it would only return the data in the div.

Next I tried just using "/text()". This was the only change

Attempt 2:

    for sel in response.xpath("//div[@class='DivTimeSpan']"):
        s_item['schedule'] = sel.select('/text()').extract()
    return s_item

Returns:

[{"schedule": []}]

Desired Result:

[{"schedule": ["11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"]}]

The url I'm scraping is behind a company login so I can't give out the actual url.

Elisha's post lead me in the right direction, Thanks!!! :) Answer:

for sel in response.xpath("//div[@class='DivTimeSpan']"):
    s_item['schedule'] = map(unicode.strip,    sel.select('//div/text()').extract())
return s_item

Upvotes: 1

Views: 571

Answers (1)

Elisha
Elisha

Reputation: 23790

The second attempt is closer to extracting the value. Yet, you need to extract the text from the node, and not from the document root:

s_item['schedule'] = sel.select('/div/text()').extract()[0]

In case the document contains more tags (which are not divs), you can try:

s_item['schedule'] = sel.select('//div/text()').extract()[0]

Upvotes: 1

Related Questions