Reputation: 1983

How to download images and files in StormCrawler?

I have crawled some images and files URLs from different web pages using StormCrawler and SOLR, and I have these URLs in status core of SOLR. Now I want to download the files from these URLs and save them on my machine. Any suggestions how to do this in a simple and scalable way? Thank you.

Upvotes: 1

Answers (1)

Julien Nioche

Reputation: 4864

The crawler already downloads them! You don't need to do that again. What you need though is to decide where and how to store the content. If you were to build a search engine, then you'd use the SOLR or Elasticsearch indexers; if you needed to scrape a site, you'd send the extracted metadata into a DB; if what you wanted was to archive the pages, then the WARC module would allow you to generate archives.

Do you want the binary content of the pages or the extracted text and metadata? If you want the former, then the WARC module would be fine. Otherwise, you can always write your own indexer bolt, StdOutIndexer should be a good starting point.

Upvotes: 0

How to download images and files in StormCrawler?

Answers (1)

Related Questions