Reputation: 77
I have deployed some Scrapy spiders to scrape data which I can download in .csv from ScrapingHub.
Some of these spiders have FilePipeline which I used to download files (pdf) to a specific folder. Is there any way I can retrieve these files from ScrapingHub via the platform or API?
Upvotes: 0
Views: 763
Reputation: 675
Though I have to go over scraping hubs documentation, I'm quite certain despite of having a file explorer there's no actual file being generated or it's being ignored while during the crawl and stanchion... I assume so given the fact that if you try to deploy one of your projects with anything other than the files that correspond to a scrappy project() unless you do some hacking around with your settings and setup file for then scrapinghub to accept your extra parameters orphans)... For example if you try to have a ton of start URLs in a file and then use a real and function to parse all that into your spider... Works like a charm but scrapinghub wasn't built with that in mind...
I assume you know that you can download your files in a CSV or desired format straight from the web interface... Personally I use scraping Hub client API in Python... All three libraries of which I believe to our deprecated at this point but you kind of have to mix and match to get fully functional feet for example...
I have this side gig for a pretty well-known pornt website, what I do for them is content aggregation I spend a lot of time watching a lot o debauchery but for people like myself it's just fun... Hope that you're reading this and not think too much of a pervert LOL got to make that money right? Anyways... By using scraping hugs API client for python I'm able to connect to my account with the API key and maneuver my way around and do as I please; personally I think that there are some limitations , not so much of a limitation is just that one thing that really bothers me is that the function to get the name of a project was deprecated with the first version of there client Library... I'd like the see, when I'm parsing my items the name of the project of which where the spider is to run different jobs Ergo the crawlz... So when I first started to mess around with the client it just look messy,
What's even more awesome it's my life so sweet is that when you create a project run your spider and all your items are collected can directly download these files from the web interface as I mentioned, but what I can do is Target my output to give me desired effect for example.
I'm crawling a site and I'm getting a media item like videos, there are three things you always need. The name of the media or the title of the video , the URL source to where the video can be reached or URL where it is embedded of which you can then request for every instance that you need... And of course the metadata of what is tags and categories that are associated with video media.
The largest crawl that's outputted the most items now I believe was 150,000, it was abroad crawl and it was something like the 15 or 17% of dupla Fire cases. Each video I then call using the API client by its given dictionary or key value (not a dictionary btw)... Of course in my case I will always use all three of the key values but I can Target categories or tags of which RN or under the key value o its corresponding place and output only the items and their totality (meaning still output all three items) foot print out only the ones that meet or match a particular string or expression I want allowing me the able who really Parts through my content quite effectively. In this particular scrapy project, Im just simply printing out or creating a .m3u playlist from all this 'pronz'!
Upvotes: 1