Django-dynamic-scraper unable to scrape the data

I am new to using dynamic scraper, and I have used the following sample for learningopen_news. I have everything set up but it keeps me showing the same error: dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.

2015-11-20 18:45:11+0000 [article_spider] ERROR: Spider error processing <GET https://en.wikinews.org/wiki/Main_page>
Traceback (most recent call last):
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 645, in _tick
    taskObj._oneWorkUnit()
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 491, in _oneWorkUnit
    result = next(self._iterator)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
    yield next(it)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
    for x in result:
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/spiders/django_spider.py", line 378, in parse
    rpt = self.scraper.get_rpt_for_scraped_obj_attr(url_elem.scraped_obj_attr)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/models.py", line 98, in get_rpt_for_scraped_obj_attr
    return self.requestpagetype_set.get(scraped_obj_attr=soa)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/query.py", line 334, in get
    self.model._meta.object_name
dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.

Upvotes: 1

Answers (2)

0xboz

Reputation: 11

I might be late for the party, but hopefully my solution could be somewhat helpful for those coming across later.

@alan-nala solution works well. However, it basically skips the detail page scraping.

Here is how you can take full advantage of the detail page scraping.

First, go to Home › Dynamic_Scraper › Scrapers › Wikinews Scraper (Article) and add those in Request page types.

Second, make sure your elements look like this in SCRAPER ELEMS.

Now, you can run the manual scraping command according to the doc

scrapy crawl article_spider -a id=1 -a do_action=yes

Well, you are likely to encounter an error as mentioned by @alan-nala

"ERROR: Mandatory elem title missing!"

Please pay attention to the error screenshot, I have a message indicating the script is "Calling DP2 URL for..." in my case.

Finally, you can go back to SCRAPER ELEMS and change Request page type of the element "title (Article)" to "Detail Page 2" instead of "Detail Page 1".

Save your settings and run the scrapy command again.

Note: Your "Detail Page #" might vary.

By the way, I have also prepared a short tutorial hosted by GitHub, in case you need more details on this topic.

Upvotes: 1

Casper Tsui

Reputation: 149

This is caused by "REQUEST PAGE TYPES" is missing. Each "SCRAPER ELEMS" must have it's own "REQUEST PAGE TYPES".

To solve this problem, please follow the steps below:

Login admin page (usually http://localhost:8000/admin/)
Go to Home › Dynamic_Scraper › Scrapers › Wikinews Scraper (Article)
Click on "Add another Request page type" under "REQUEST PAGE TYPES"
Create 4 "REQUEST PAGE TYPES" in total for each "(base (Article))", "(title (Article))", "(description (Article))" and "(url (Article))"

"REQUEST PAGE TYPES" Settings

All "Content type" are "HTML"

All "Request type" are "Request"

All "Method" are "Get"

For "Page type", just assign them in sequence like

(base (Article)) | Main Page

(title (Article)) | Detail Page 1

(description (Article) | Detail Page 2

(url (Article)) | Detail Page 3

After the steps above you should fix "DoesNotExist: RequestPageType" error.

However, "ERROR: Mandatory elem title missing!" would come up!

To solve this. I suggest you changing all "REQUEST PAGE TYPE" in "SCRAPER ELEMS" to "Main Page" including "title (Article)".

And then change the XPath as follow:

(base (Article)) | //td[@class="l_box"]

(title (Article)) | span[@class="l_title"]/a/@title

(description (Article) | p/span[@class="l_summary"]/text()

(url (Article)) | span[@class="l_title"]/a/@href

After all, run scrapy crawl article_spider -a id=1 -a do_action=yes on command prompt. You should be able to crawl the "Article". You may check it from Home › Open_News › Articles

Enjoy~

Upvotes: 3

Django-dynamic-scraper unable to scrape the data

Answers (2)

Related Questions