Muttonchop
Muttonchop

Reputation: 392

How to integrate javascript rendering module to scrapy?

I am working on a web scraping program, but I have run into a problem using scrapy with javascript generated content. I know that scrapy is not built to do this type of scraping, but I have been trying to use scrapyjs or splash to accomplish what I need.

However, I cannot get either of these two modules to work correctly with scrapy. My question is if anyone has a minimal example they can show that uses scrapyjs or splash to render javascript pages?

Edit: My platform is ubuntu and I working with python. For scrapyjs I just put the source in the uppermost directory of the scrapy project and I have yet to find any real guides on how to use splash. The reason I am asking about splash is because it seems a more powerful module for javascript rendering and is mentioned a lot in the same conversation as scrapjs.

Upvotes: 4

Views: 2458

Answers (1)

Tony
Tony

Reputation: 19151

I believe all you have to do is implement process_links in your Spider:

def proxy_url(url):
        return "http://localhost:8050/render.html?url=%s&timeout=15&wait=1" % url


def process_links(self,links):
        for link in links:
            link.url = proxy_url(link.url)
        return links

Upvotes: 1

Related Questions