Reputation: 57391
I've written a spider of which the sole purpose is to extract one number from http://www.funda.nl/koop/amsterdam/, namely, the maximum number of pages from the pager at the bottom (e.g., the number 255 in the example below).
I managed to do this using the LinkExtractor based on the regular expression that URLs of these pages match. The spider is shown below:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from Funda.items import MaxPageItem
class FundaMaxPagesSpider(CrawlSpider):
name = "Funda_max_pages"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le_maxpage, callback='get_max_page_number'),
)
def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
page_numbers=[]
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
page_numbers.append(page_number)
# if page_number > max_page_number:
# max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
max_page_number = max(page_numbers)
print("The maximum page number is %s" % max_page_number)
yield {'max_page_number': max_page_number}
If I run this with feed output by entering scrapy crawl Funda_max_pages -o funda_max_pages.json
at the command line, the resulting JSON file looks like this:
[
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257}
]
I find it strange that the dict is outputted 7 times instead of just once. After all, the yield
statement is outside of the for
loop. Can anyone explain this behavior?
Upvotes: 0
Views: 29
Reputation: 57391
As a workaround, I've written the output to a text file to be used instead of the JSON feed output:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class FundaMaxPagesSpider(CrawlSpider):
name = "Funda_max_pages"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le_maxpage, callback='get_max_page_number'),
)
def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
print("The link is %s" % link.url)
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
if page_number > max_page_number:
max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
print("The maximum page number is %s" % max_page_number)
place_name = link.url.split("/")[-3] # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/'
print("The place name is %s" % place_name)
filename = str(place_name)+"_max_pages.txt" # File name with as prefix the place name
with open(filename,'wb') as f:
f.write('max_page_number = %s' % max_page_number) # Write the maximum page number to a text file
yield {'max_page_number': max_page_number}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(FundaMaxPagesSpider)
process.start() # the script will block here until the crawling is finished
I've also adapted the spider to run it as a script. The script will generate a text file amsterdam_max_pages.txt
with a single line max_page_number: 257
.
Upvotes: 0
Reputation: 21406
get_max_page_number
on every one of those. get_max_page_number
returns a dictionary.Upvotes: 3