Scrapy cant scrape linked .css files

Question

I have a broad crawler that goes through all the pages, extracts links with the link extractor and continues. However, I'd also like to scrape all the linked css and js documents so I wrote separate rules to handle the js, css and normal links like below:

rules = [
        # rule to process most links
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True,
            ),
            follow=True,
            callback="parse_items",
            process_links='filter_links',
        ),
        # rule to process css links
        Rule(
            LinkExtractor(
                unique=True,
                tags=['link'],
                attrs=['href'],
                process_value=process_css
            ),
            follow=True,
            callback='parse_items',
            process_links='filter_resources'
        ),
        # Rule to find js links
        Rule(
            LinkExtractor(
                unique=True,
                tags=['script'],
                attrs=['src'],
            ),
            follow=True,
            callback='parse_items',
            process_links='filter_resources'
    )
    ]

The process_css function just prints whatever is passing through. With this setup, I can crawl and access all the js files and their content but not the css files. To be specific, these rules find the css and js links without an issue but the css links I think are not followed.

Edit: the Response content isn't text error was due to something else.

C.Acarbay · Accepted Answer

The problem was due to the deny extensions parameter in Link extractor defaulting to the values given here: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/init.py

They have css listed as a no-go extension.

Scrapy cant scrape linked .css files

Answers (2)

Related Questions