Reputation: 1837
I am new to using dynamic scraper, and I have used the following sample for learningopen_news. I have everything set up but it keeps me showing the same error: dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.
2015-11-20 18:45:11+0000 [article_spider] ERROR: Spider error processing <GET https://en.wikinews.org/wiki/Main_page>
Traceback (most recent call last):
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 645, in _tick
taskObj._oneWorkUnit()
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 491, in _oneWorkUnit
result = next(self._iterator)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
for x in result:
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/spiders/django_spider.py", line 378, in parse
rpt = self.scraper.get_rpt_for_scraped_obj_attr(url_elem.scraped_obj_attr)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/models.py", line 98, in get_rpt_for_scraped_obj_attr
return self.requestpagetype_set.get(scraped_obj_attr=soa)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/manager.py", line 127, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/query.py", line 334, in get
self.model._meta.object_name
dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.
Upvotes: 1
Views: 693
Reputation: 11
I might be late for the party, but hopefully my solution could be somewhat helpful for those coming across later.
@alan-nala solution works well. However, it basically skips the detail page scraping.
Here is how you can take full advantage of the detail page scraping.
First, go to Home › Dynamic_Scraper › Scrapers › Wikinews Scraper (Article) and add those in Request page types.
Second, make sure your elements look like this in SCRAPER ELEMS.
Now, you can run the manual scraping command according to the doc
scrapy crawl article_spider -a id=1 -a do_action=yes
Well, you are likely to encounter an error as mentioned by @alan-nala
"ERROR: Mandatory elem title missing!"
Please pay attention to the error screenshot, I have a message indicating the script is "Calling DP2 URL for..." in my case.
Finally, you can go back to SCRAPER ELEMS and change Request page type of the element "title (Article)" to "Detail Page 2" instead of "Detail Page 1".
Save your settings and run the scrapy command again.
Note: Your "Detail Page #" might vary.
By the way, I have also prepared a short tutorial hosted by GitHub, in case you need more details on this topic.
Upvotes: 1
Reputation: 149
This is caused by "REQUEST PAGE TYPES" is missing. Each "SCRAPER ELEMS" must have it's own "REQUEST PAGE TYPES".
To solve this problem, please follow the steps below:
"REQUEST PAGE TYPES" Settings
All "Content type" are "HTML"
All "Request type" are "Request"
All "Method" are "Get"
For "Page type", just assign them in sequence like
(base (Article)) | Main Page
(title (Article)) | Detail Page 1
(description (Article) | Detail Page 2
(url (Article)) | Detail Page 3
After the steps above you should fix "DoesNotExist: RequestPageType" error.
However, "ERROR: Mandatory elem title missing!" would come up!
To solve this. I suggest you changing all "REQUEST PAGE TYPE" in "SCRAPER ELEMS" to "Main Page" including "title (Article)".
And then change the XPath as follow:
(base (Article)) | //td[@class="l_box"]
(title (Article)) | span[@class="l_title"]/a/@title
(description (Article) | p/span[@class="l_summary"]/text()
(url (Article)) | span[@class="l_title"]/a/@href
After all, run scrapy crawl article_spider -a id=1 -a do_action=yes
on command prompt.
You should be able to crawl the "Article".
You may check it from Home › Open_News › Articles
Enjoy~
Upvotes: 3