Reputation: 434
I have a broad crawler that goes through all the pages, extracts links with the link extractor and continues. However, I'd also like to scrape all the linked css and js documents so I wrote separate rules to handle the js, css and normal links like below:
rules = [
# rule to process most links
Rule(
LinkExtractor(
canonicalize=True,
unique=True,
),
follow=True,
callback="parse_items",
process_links='filter_links',
),
# rule to process css links
Rule(
LinkExtractor(
unique=True,
tags=['link'],
attrs=['href'],
process_value=process_css
),
follow=True,
callback='parse_items',
process_links='filter_resources'
),
# Rule to find js links
Rule(
LinkExtractor(
unique=True,
tags=['script'],
attrs=['src'],
),
follow=True,
callback='parse_items',
process_links='filter_resources'
)
]
The process_css function just prints whatever is passing through. With this setup, I can crawl and access all the js files and their content but not the css files. To be specific, these rules find the css and js links without an issue but the css links I think are not followed.
Edit: the Response content isn't text error was due to something else.
Upvotes: 0
Views: 249
Reputation: 360
You can do this simply by defining tags and attrs in your linkextractors itself.
self.link_extractor = LinkExtractor(tags = ('img','a','area', 'link', 'script'),attrs=('src','href'), deny_extensions=set())
Make sure you set the required tags and attrs. Also set the deny_extensions to an empty tuple.
Upvotes: 0
Reputation: 434
The problem was due to the deny extensions parameter in Link extractor defaulting to the values given here: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/init.py
They have css listed as a no-go extension.
Upvotes: 2