Laurențiu Dascălu
Laurențiu Dascălu

Reputation: 2069

How to use Scrapy

I would like to know how can I start a crawler based on Scrapy. I installed the tool via apt-get install and I tried to run an example:

/usr/share/doc/scrapy/examples/googledir/googledir$ scrapy list
directory.google.com

/usr/share/doc/scrapy/examples/googledir/googledir$ scrapy crawl

I hacked the code from spiders/google_directory.py but it seems that it is not executed, because I don't see any prints that I inserted. I read their documentation, but I found nothing related to this; do you have any ideas?

Also, if you think that for crawling a website I should use other tools, please let me know. I'm not experienced with Python tools and Python is a must.

Thanks!

Upvotes: 3

Views: 5750

Answers (2)

fmalina
fmalina

Reputation: 6310

EveryBlock.com released some quality scraping code using lxml, urllib2 and Django as their stack.

Scraperwiki.com is inspirational, full of examples of python scrapers.

Simple example with cssselect:

from lxml.html import fromstring

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

Upvotes: 7

Pablo Hoffman
Pablo Hoffman

Reputation: 1540

You missed the spider name in the crawl command. Use:

$ scrapy crawl directory.google.com

Also, I suggest you copy the example project to your home, instead of working in the /usr/share/doc/scrapy/examples/ directory, so you can modify it and play with it:

$ cp -r /usr/share/doc/scrapy/examples/googledir ~
$ cd ~/googledir
$ scrapy crawl directory.google.com

Upvotes: 7

Related Questions