Reputation: 10580
The website that I am crawling contains many players and when I click on any player, I can go the his page.
The website structure is like this:
<main page>
<link to player 1>
<link to player 2>
<link to player 3>
..
..
..
<link to payer n>
</main page>
And when I click on any link, I go to player's page which is like this:
<player name>
<player team>
<player age>
<player salary>
<player date>
I want to scrap all the players those age is between 20 and 25 years.
scraping the main page using first spider.
getting links using first spider.
crawl each link using second spider.
get the player informatoin using second spider.
save this information in json file using pipeline.
how can I return the date
value from second spider
to the first spider
I build my own middelware and i override the process_spider_output
. it allows me to print the request but I don't know what else should I do in order to return that date
value to my first spider
any help is appreciated
Here is some of the code:
def parse(self, response):
sel = Selector(response)
Container = sel.css('div[MyDiv]')
for player in Container:
extract LINK and TITLE
yield Request(LINK, meta={'Title': Title}, callback = self.parsePlayer)
def parsePlayer(self,response):
player = new PlayerItem();
extract DATE
return player
Upvotes: 5
Views: 4179
Reputation: 9644
All you need to do is check the date
in parsePlayer
, and return only the relevant.
def parsePlayer(self,response):
player = new PlayerItem();
extract DATE
if DATE == some_criteria:
yield player
For example, if you have performance issues (you are scrapping way too much links and you don't need the ones after some limit).
Given that Scrapy work in asymmetric requests, there is no real good way to do that. The only way you have is trying to force linear behavior instead of default parallel requests.
Let me explain. When you have two callbacks like that, on default behavior scrapy will first parse the first page (main page) and put in its queue all requests for the player pages. Without waiting for that first page to finish being scrapped, it will start treating these requests for player pages (not necessarily in the order it found them).
Therefore, when you get the information that the player page p
is out of date, it has already sent internal requests for p+1
, p+2
...p+m
(m
is basically a random number) AND has probably started treating some of these requests. Possibly even p+1
before p
(no fixed order, remember).
So no way to stop exactly at the right page if you keep this pattern, and no way to interact with parse
from parsePlayer
.
What you can do is force it to follow the links in order, so that you have full control. The drawback is that it will take a big toll on performance: if scrapy follows each link one after the other, it means it can't treat them simultaneously as it usually does and it slows things down.
The code could be something like:
def parse(self, response):
sel = Selector(response)
self.container = sel.css('div[MyDiv]')
return self.increment(0)
# Function that will yield the request for player n°index
def increment(index):
player = self.container[index] # select current player
extract LINK and TITLE
yield Request(LINK, meta={'Title': Title, 'index': index}, callback=self.parsePlayer)
def parsePlayer(self,response):
player = new PlayerItem();
extract DATE
yield player
if DATE == some_criteria:
index = response.meta['index'] + 1
self.increment(index)
That way scrapy will get the main page, then the first player, then the main page, then the second player, then the main, etc... until it finds a date that doesn't fit the criteria. Then there is no callback to the main function and the spider stops.
This gets a little more complex if you have to also increment the index of the main page (if there are n main pages for example), but the idea stays the same.
Upvotes: 4
Reputation: 10580
First of all, I want to thank @warwaruk, @Robin for helping me in this issue.
And the best thanks to my great teacher @pault
I found the solution and here is the algorithm:
In the callback for each player:
4.1 extract player's information.
4.2 check if the date in the rage, if no: do nothing, if yes: check if this is the last play in the main player list. if yes, callback to the second main page.
def parse(self, response):
currentPlayer = 0
for each player in Players:
currentPlayer +=1
yield Request(player.link, meta={'currentPlayer':currentPlayer, 'numberOfPlayers':len(Players),callback = self.parsePlayer)
def parsePlayer(self,response):
currentPlayer = meta['currentPlayer]
numberOfPlayers = meta['numberOfPlayers']
extract player's information
if player[date] in range:
if currentPlayer == numberOfPlayers:
yield(linkToNextMainPage, callback = self.parse)
yield playerInformatoin #in order to be written in JSON file
else:
yield playerInformaton
It works perfectly :)
Upvotes: 2
Reputation: 59674
Something like (based on Robin's answer):
class PlayerSpider(Spider):
def __init__(self):
self.player_urls = []
self.done = False # flag to know when a player with bday out of range found
def extract_player_urls(self, response):
sel = Selector(response)
self.player_urls.extend(extracted player links)
def parse(self, response):
self.extract_player_urls(response)
for i in xrange(10):
yield Request(self.player_urls.pop(), parse=self.parse_player)
def parse_player(self, response):
if self.done:
return
... extract player birth date
if bd_date not in range:
self.done = True
... somehow clear downloader queue
return
... create and fill item
yield item
yield Request(self.player_urls.pop(), parse=self.parse_player)
Upvotes: 2