Zheng
Zheng

Reputation: 93

Some questions about using multiple piplines in scrapy

I'm new to scrapy and I started a simple project several days ago. I have successfully implemented items.py, my_spider.py and piplines.py to scrape some information into a json file. Now I'd like to add some features to my spider and encountered some questions.

I have already scraped the desired information on threads of a forum, including the file_urls and image_urls. I'm a little confused about the tutorial by Scrapy Documentation, here are the related parts in my files:

**settings.py**
...
ITEM_PIPELINES = {
    'my_project.pipelines.InfoPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 300,
    'scrapy.pipelines.files.FilesPipeline': 300,
}
FILES_STORE = './Downloads'
IMAGES_STORE = './Downloads'
**items.py**
...
class InfoIterm(scrapy.Item):
    movie_number_title = scrapy.Field()
    movie_pics_links = scrapy.Field()
    magnet_link = scrapy.Field()
    torrent_link = scrapy.Field()
    torrent_name = scrapy.Field()


class TorrentItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()


class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
**piplines.py**
...
def process_item(self, item, spider):
    contents = json.dumps(dict(item), indent=4, sort_keys=True, ensure_ascii=False)
    with open("./threads_data.json", "wb") as f:
        f.write(contents.encode("utf-8"))
    return item
**my_spider.py**
...
def parse_thread(self, response):
    json_item = InfoIterm()
    json_item['movie_number_title'] = response.xpath("//span[@id='thread_subject']/text()").getall()
    json_item['movie_pics_links'] = response.xpath("//td[@class='t_f']//img/@file").getall()
    json_item['magnet_link'] = response.xpath("//div[@class='blockcode']/div//li/text()").getall()
    json_item['torrent_name'] = response.xpath("//p[@class='attnm']/a/text()").getall()
    json_item['torrent_link'] = self.base_url + response.xpath("//p[@class='attnm']/a/@href").getall()[0]
    yield json_item

    torrent_link = self.base_url + response.xpath("//p[@class='attnm']/a/@href").getall()
    yield {'file_urls': torrent_link}

    movie_pics_links = response.xpath("//td[@class='t_f']//img/@file").getall()
    yield {'image_urls': movie_pics_links}

Now I can download images successfully, but files are not downloaded. My json file is also overridden by the last image_urls.

So, here are my questions:

  1. Can one spider use multiple piplines? If possible, what's the best way to use them (For example, in my case. Some example will be great!)?
  2. In some cases some of these json_item['xxx'] are not presented on certain threads, and the consol will print some information reporting the problem. I tried to use try-except on each line of there code, but it becomes really ugly and I believe there should be some better way to do that. What is the best way to do that?

Thanks a lot.

Upvotes: 0

Views: 61

Answers (1)

renatodvc
renatodvc

Reputation: 2564

1- Yes you can use several pipelines, you need to mind the order in which they are called though. (More on that here)

If they are meant to process different Item objects, all you need to do is to check the class of the item received in the process_item method. Process the ones you want, return the others untouched.

2- What is the error, can't help much without that information. Please post an execution log.

Upvotes: 1

Related Questions