How can i use scrapy to parse links in JS?

Question

I am trying to get scrapy to parse the links on a page to scrape. Unfortunatly the links on this page are enclosed in a JavaScript onclick function. I would like to use the SgmlLinkExtractor rule to extract the link to parse the JavaScript and create the URL to use with callback='parse_item' if possible.

Here is an example of the each link with the JS function:

Link Text

I just need the link extractor to send to callback parse_item: http://domain.com/page.asp?ProductID=3679

How would i write CrawlSpider rules to do this?

If this is not possible what would be the best way to end up being able to parse all pages embeded in this format of JavaScript links on a defined set of start pages?

Thank you all.

paul trmbrth · Accepted Answer

You can use the attrs parameter of SgmlLinkExtractor.

attrs (list) – list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)

and process_value parameter from BaseSgmlLinkExtractor:

process_value (callable) – a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.

So you would write a parsing function for "onclick" attributes' values:

def process_onclick(value):
    m = re.search("window.open\('(.+?)'", value)
    if m:
        return m.group(1)

Let's check that regular expression:

>>> re.search("window.open\('(.+?)'",
...           "window.open('page.asp?ProductID=3679','productwin','width=700,height=475,scrollbars,resizable,status');"
...          ).group(1)
'page.asp?ProductID=3679'
>>>

And then use it in a Rule with SgmlLinkExtractor

rules=(
    Rule(SgmlLinkExtractor(allow=(),
                           attrs=('onclick',),
                           process_value=process_onclick),
         callback='parse_item'),
)

How can i use scrapy to parse links in JS?

Answers (2)

Related Questions