C.Acarbay
C.Acarbay

Reputation: 434

Scrapy cant scrape linked .css files

I have a broad crawler that goes through all the pages, extracts links with the link extractor and continues. However, I'd also like to scrape all the linked css and js documents so I wrote separate rules to handle the js, css and normal links like below:

rules = [
        # rule to process most links
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True,
            ),
            follow=True,
            callback="parse_items",
            process_links='filter_links',
        ),
        # rule to process css links
        Rule(
            LinkExtractor(
                unique=True,
                tags=['link'],
                attrs=['href'],
                process_value=process_css
            ),
            follow=True,
            callback='parse_items',
            process_links='filter_resources'
        ),
        # Rule to find js links
        Rule(
            LinkExtractor(
                unique=True,
                tags=['script'],
                attrs=['src'],
            ),
            follow=True,
            callback='parse_items',
            process_links='filter_resources'
    )
    ]

The process_css function just prints whatever is passing through. With this setup, I can crawl and access all the js files and their content but not the css files. To be specific, these rules find the css and js links without an issue but the css links I think are not followed.

Edit: the Response content isn't text error was due to something else.

Upvotes: 0

Views: 249

Answers (2)

Sam
Sam

Reputation: 360

You can do this simply by defining tags and attrs in your linkextractors itself.

self.link_extractor = LinkExtractor(tags = ('img','a','area', 'link', 'script'),attrs=('src','href'), deny_extensions=set())

Make sure you set the required tags and attrs. Also set the deny_extensions to an empty tuple.

Upvotes: 0

C.Acarbay
C.Acarbay

Reputation: 434

The problem was due to the deny extensions parameter in Link extractor defaulting to the values given here: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/init.py

They have css listed as a no-go extension.

Upvotes: 2

Related Questions