Reputation: 379
I have written some python code with scrapy to extract some addresses from a website.
The first part of the code is putting together the start_urls
by reading the latitude and longitude coordinates from a separate file googlecoords.txt which then form part of the start_urls
. (The googlecoords.txt file I prepared previously converts UK postcodes in google coordinates for googlemaps).
So, for example, the first item in the start_url
list is "https://www.howdens.com/process/searchLocationsNear.php?lat=53.674434&lon=-1.4908923&distance=1000&units=MILES" where "lat=53.674434&lon=-1.4908923" have come from the googlecoors.txt file.
However, when I run the code it works perfectly except that it prints out the googlecoords.txt file first - which I don't need.
How do I stop this print happening? (Though I can live with it.)
import scrapy
import sys
from scrapy.http import FormRequest, Request
from Howdens.items import HowdensItem
class howdensSpider(scrapy.Spider):
name = "howdens"
allowed_domains = ["www.howdens.com"]
# read the file that has a list of google coordinates that are converted from postcodes
with open("googlecoords.txt") as f:
googlecoords = [x.strip('\n') for x in f.readlines()]
# from the goole coordinates build the start URLs
start_urls = []
for a in range(len(googlecoords)):
start_urls.append("https://www.howdens.com/process/searchLocationsNear.php?{}&distance=1000&units=MILES".format(googlecoords[a]))
# cycle through 6 of the first relevant items returned in the text
def parse(self, response):
for sel in response.xpath('/html/body'):
for i in range(0,6):
try:
item = HowdensItem()
item['name'] =sel.xpath('.//text()').re(r'(?<="name":")(.*?)(?=","street")')[i]
item['street'] =sel.xpath('.//text()').re(r'(?<="street":")(.*?)(?=","town")')[i]
item['town'] = sel.xpath('.//text()').re(r'(?<="town":")(.*?)(?=","pc")')[i]
item['pc'] = sel.xpath('.//text()').re(r'(?<="pc":")(.*?)(?=","state")')[i]
yield item
except IndexError:
pass
Upvotes: 0
Views: 61
Reputation: 21436
Like someone in the comments pointed out you should load it up with json
module in start_requests()
method:
import scrapy
import json
class MySpider(scrapy.Spider):
start_urls = ['https://www.howdens.com/process/searchLocationsNear.php?lat=53.674434&lon=-1.4908923&distance=1000&units=MILES']
def parse(self, response):
data = json.loads(response.body_as_unicode())
items = data['response']['depots']
for item in items:
url_template = "https://www.howdens.com/process/searchLocationsNear.php?{}&distance=1000&units=MILES"
url = url_template.format(item['lat']) # format in your location here
yield scrapy.Request(url, self.parse_item)
def parse_item(self, response):
print(response.url)
Upvotes: 1