sepidfekr
sepidfekr

Reputation: 11

Why Do I get an empty output in scrapy for my items?

I'm a newbie in python and scrapy. I'm going to scrape a page of some links to get my desired data but when i generate my output, my desired items are empty.

My items.py code is as follows:

class CinemaItem(Item):
    url = Field()
    name = Field()
    pass

My cinema_spider.py is as follows:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from cinema.items import CinemaItem

class CinemaSpider(CrawlSpider):
    name = "cinema"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/?user=artists"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=['/\?user=profile&detailid=\d+']),'parse_cinema')]

    def parse_cinema(self, response):
        hxs = HtmlXPathSelector(response)
        cinema = CinemaItem()
        cinema['url'] = response.url
        cinema['name'] = hxs.select("//html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr/td/text()").extract()
        return cinema

when i run the following command:

scrapy crawl cinema -o scraped_data.json -t json

The output file has such a content:

[{"url": "http://www.example.com/?detailid=218&user=profile", "name": []},
{"url": "http://www.example.com/?detailid=322&user=profile", "name": []},
{"url": "http://www.example.com/?detailid=219&user=profile", "name": []},
{"url": "http://www.example.com/?detailid=221&user=profile", "name": []}]

As you see, the name items are empty, although in fact, they have values and i can get them when i fetch them in scrapy shell. But, since their values are in Persian language and probably in unicode format, the output in shell is as:

[u'\u0631\u06cc\u062d\u0627\u0646\u0647 \u0628\u0627\u0642\u0631\u06cc \u0628\u0627\u06cc\u06af\u06cc']

I changes the spider code as following to change the items encoding:

cinema['name'] = hxs.select("//html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr/td/text()").extract()[0].encode('utf-8')

But got such an error:

cinema['name'] = hxs.select("//html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr[2]/td/table/tbody/tr/td/text()").extract()[0].encode('utf-8')
exceptions.IndexError: list index out of range

Then, i undo that change to my spider code, and according to this post, wrote my own pipelines.py to change the default value of ensure_ascii and turn it into "False":

import json
import codecs

class CinemaPipeline(object):

    def __init__(self):
        self.file = codecs.open('scraped_data_utf8.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

But, the resulted output file was the same with empty name items.

I read almost all posts in stackoverflow about this issue but can't resolve that. What's the problem?

Edited:

Some snippet of HTML:

<div class="content font-fa" style="margin-top:10px;">
    <div class="content-box">
        <div class="content-text" dir="rtl" style="width:240px;min-height:200px;text-align:center"><img src='../images/others/no-photo.jpg' ></div>
        <div class="content-text" dir="rtl" style="width:450px;float:right;min-height:200px;" >
            <div class="content-row" style="text-align:right;margin-right:0px;">
                <span class="FontsFa">
                    <span align="right">
                        <strong class="font-11 "> نام/نام خانوادگی : </strong>
                    </span>
                    <span class="large-title">
                        <span class="bold font-13" style="color:#900;">ریحانه باقری بایگی
                        </span>
                    </span>
               </span>
            </div>

I want to get the text between <span class="bold font-13" style="color:#900;">ریحانه باقری بایگی </span>

Upvotes: 0

Views: 4175

Answers (1)

Robin
Robin

Reputation: 9644

The issue is very probably that your XPath doesn't match your wanted data.

hxs.select(...).extract() will get you an empty array, when you try to change encoding you're calling hxs.select(...).extract()[0] which throws an IndexError.

How did you find that XPath? Did you test it inside of your spider? Beware that HTML as shown in browser and in scrapy might be different, usually because scrapy doesn't execute javascript. As a general rule you should always check response.body to be what you expect.

Also, your XPath is very easily breakable because it uses absolute positions. This means that any change anywhere in your path will break the whole thing. Usually it is best to try and rely on ids or unique characteristics (//td[id="foobar"]).

Could you provide a relevant snippet of the HTML you're trying to parse?

Upvotes: 4

Related Questions