Mahdi
Mahdi

Reputation: 807

Running a Scrapy Crawler

I am very new in Python and Scrapy and I have written a crawler in PyCharm as follow:

import scrapy
from scrapy.spiders import Spider
from scrapy.http    import Request
import re

class TutsplusItem(scrapy.Item):
  title = scrapy.Field()



class MySpider(Spider):
  name = "tutsplus"
  allowed_domains   = ["bbc.com"]
  start_urls = ["http://www.bbc.com/"]

  def parse(self, response):
    links = response.xpath('//a/@href').extract()

# We stored already crawled links in this list
crawledLinks = []

for link in links:
  # If it is a proper link and is not checked yet, yield it to the Spider
  #if linkPattern.match(link) and not link in crawledLinks:
  if not link in crawledLinks:
    link = "http://www.bbc.com" + link
    crawledLinks.append(link)
    yield Request(link, self.parse)

titles = response.xpath('//a[contains(@class, "media__link")]/text()').extract()
for title in titles:
  item = TutsplusItem()
  item["title"] = title
  print("Title is : %s" %title)
  yield item

However, when I run above codes, nothing prints on the screen! What is wrong in my code?

Upvotes: 1

Views: 921

Answers (3)

Hosni
Hosni

Reputation: 668

To run a spider from within Pycharm you need to configure "Run/Debug configuration" properly. Running your_spider.py as a standalone script wouldn't result in anything.

As mentioned by @stranac scrapy crawl is the way to go. With scrapy being a binary and crawl an argument of your binary.

Configure Run/Debug

In the main menu go to : Run > Run Configurations...

  • Find the appropriate scrapy binary within your virtualenv and set its absolute path as Script. This should look like something like this: /home/username/.virtualenvs/your_virtualenv_name/bin/scrapy
  • In Scrapy parameters set up the parameters the binary scrapy will execute. In your case, you wan to start your spider. this is how this should look like:

crawl your_spider_name e.g. crawl tutsplus

  • Make sure that the Python intrepreter is the one where you setup Scrapy and other packages needed for your project.

  • Make sure that the working directory is the directory containing settings.pywhich is also generated by Scrapy.

From now on you should be able to Run and Debug your spiders from within Pycharm.

Upvotes: 0

宏杰李
宏杰李

Reputation: 12158

Put the code in a text file, name it to something like your_spider.py and run the spider using the runspider command:

scrapy runspider your_spider.py

Upvotes: 1

stranac
stranac

Reputation: 28206

You would typically start scrapy using scrapy crawl, which will hook everything up for you and start the crawling.

It also looks like your code is not properly indented (only one line inside parse when they all should be).

Upvotes: 0

Related Questions