Some questions about using multiple piplines in scrapy

Question

I'm new to scrapy and I started a simple project several days ago. I have successfully implemented items.py, my_spider.py and piplines.py to scrape some information into a json file. Now I'd like to add some features to my spider and encountered some questions.

I have already scraped the desired information on threads of a forum, including the file_urls and image_urls. I'm a little confused about the tutorial by Scrapy Documentation, here are the related parts in my files:

**settings.py**
...
ITEM_PIPELINES = {
    'my_project.pipelines.InfoPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 300,
    'scrapy.pipelines.files.FilesPipeline': 300,
}
FILES_STORE = './Downloads'
IMAGES_STORE = './Downloads'

**items.py**
...
class InfoIterm(scrapy.Item):
    movie_number_title = scrapy.Field()
    movie_pics_links = scrapy.Field()
    magnet_link = scrapy.Field()
    torrent_link = scrapy.Field()
    torrent_name = scrapy.Field()


class TorrentItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()


class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

**piplines.py**
...
def process_item(self, item, spider):
    contents = json.dumps(dict(item), indent=4, sort_keys=True, ensure_ascii=False)
    with open("./threads_data.json", "wb") as f:
        f.write(contents.encode("utf-8"))
    return item

**my_spider.py**
...
def parse_thread(self, response):
    json_item = InfoIterm()
    json_item['movie_number_title'] = response.xpath("//span[@id='thread_subject']/text()").getall()
    json_item['movie_pics_links'] = response.xpath("//td[@class='t_f']//img/@file").getall()
    json_item['magnet_link'] = response.xpath("//div[@class='blockcode']/div//li/text()").getall()
    json_item['torrent_name'] = response.xpath("//p[@class='attnm']/a/text()").getall()
    json_item['torrent_link'] = self.base_url + response.xpath("//p[@class='attnm']/a/@href").getall()[0]
    yield json_item

    torrent_link = self.base_url + response.xpath("//p[@class='attnm']/a/@href").getall()
    yield {'file_urls': torrent_link}

    movie_pics_links = response.xpath("//td[@class='t_f']//img/@file").getall()
    yield {'image_urls': movie_pics_links}

Now I can download images successfully, but files are not downloaded. My json file is also overridden by the last image_urls.

So, here are my questions:

Can one spider use multiple piplines? If possible, what's the best way to use them (For example, in my case. Some example will be great!)?
In some cases some of these json_item['xxx'] are not presented on certain threads, and the consol will print some information reporting the problem. I tried to use try-except on each line of there code, but it becomes really ugly and I believe there should be some better way to do that. What is the best way to do that?

Thanks a lot.

Some questions about using multiple piplines in scrapy

Answers (1)

Related Questions