Reputation: 392
I am working on a web scraping program, but I have run into a problem using scrapy with javascript generated content. I know that scrapy is not built to do this type of scraping, but I have been trying to use scrapyjs or splash to accomplish what I need.
However, I cannot get either of these two modules to work correctly with scrapy. My question is if anyone has a minimal example they can show that uses scrapyjs or splash to render javascript pages?
Edit: My platform is ubuntu and I working with python. For scrapyjs I just put the source in the uppermost directory of the scrapy project and I have yet to find any real guides on how to use splash. The reason I am asking about splash is because it seems a more powerful module for javascript rendering and is mentioned a lot in the same conversation as scrapjs.
Upvotes: 4
Views: 2458
Reputation: 19151
I believe all you have to do is implement process_links in your Spider:
def proxy_url(url):
return "http://localhost:8050/render.html?url=%s&timeout=15&wait=1" % url
def process_links(self,links):
for link in links:
link.url = proxy_url(link.url)
return links
Upvotes: 1