user3702643
user3702643

Reputation: 1495

Crawling redirected urls with scrapy

Im trying to use scrapy to crawl www.mywebsite.com.

www.mywebsite.com is hosted on a free host with the url www.mywebsite.freehost.com. I am redirecting the free host to my paid domain.

The problem here is that scrapy ignores the redirect and the end result is that 0 pages are scraped.

How do I tell scrapy that I need it to crawl the redirected url? I only need it to crawl the redirected url and not other urls that lead out of the website (like facebook pages etc.)

2016-11-27 14:48:42 [scrapy] INFO: Spider opened
2016-11-27 14:48:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-27 14:48:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-27 14:48:44 [scrapy] DEBUG: Crawled (200) <GET http://www.mywebsite.com/> (referer: None)
2016-11-27 14:48:44 [scrapy] DEBUG: Filtered offsite request to 'www.mywebsite.freehost.net': <GET www.mywebsite.freehost.net>
2016-11-27 14:48:44 [scrapy] INFO: Closing spider (finished)
2016-11-27 14:48:44 [scrapy] INFO: Dumping Scrapy stats:

Upvotes: 1

Views: 208

Answers (1)

eLRuLL
eLRuLL

Reputation: 18799

The logs show that your request is being filtered:

DEBUG: Filtered offsite request to 'www.mywebsite.freehost.net': <GET www.mywebsite.freehost.net>

Add that domain freehost.net to your allowed_domains list, or remove allowed_domains from your spider to allow every domain.

Upvotes: 1

Related Questions