Reputation: 1511
I want to scrape an API. The API will return some data and the total amount of data. I want to
But I am not sure how to do that in Scrapy. This is my start_requests
def start_requests(self):
url = "https://hkapi.centanet.com/api/Transaction/Map.json"
page = 1
headers = {
'lang': 'tc',
'Content-Type': 'application/json; charset=UTF-8',
'Connection': 'Keep-Alive',
'User-Agent': 'okhttp/4.7.2'
}
payload = {
"daterange": 180,
"postType": "s",
"refdate": "20200701",
"order": "desc",
"page": f"{page}",
"pageSize": 100,
"pixelHeight": 2220,
"pixelWidth": 1080,
"points[0].lat": 22.695053063373795,
"points[0].lng": 113.85844465345144,
"points[1].lat": 22.695053063373795,
"points[1].lng": 114.38281349837781,
"points[2].lat": 21.993328259196705,
"points[2].lng": 114.38281349837781,
"points[3].lat": 21.993328259196705,
"points[3].lng": 113.85844465345144,
"sort": "score",
"zoom": 9.745128631591797,
"platform": "android"
}
yield scrapy.Request(url, callback=self.parse, method="POST", headers=headers, body=json.dumps(payload))
This is my parse
:
def parse(self, response):
json_response = json.loads(response.text)
yield json_response
I think I can extract the total number of data and calculate the total number of page in the parse
function. But how can I take that number and construct a list of payload?
For example, if the total number of page is 3. Then I will construct a list of payload with length 3. Then loop through the payloads.
Example JSON response:
{
"DITems":[],
"TransactionCount": 34037,
"Count": 34037,
"MinPoint": {
"Lat": 22.2390387561,
"Lng": 113.9203349215
},
"MaxPoint": {
"Lat": 22.5454478015,
"Lng": 114.2243478859
},
"RoundTripNeeded": false
}
Thanks! This is my first project using Scrapy!
Upvotes: 2
Views: 156
Reputation: 2335
If I've understood you correctly, all you have to do is do a for loop around the payload and send a request based on that specific payload once you have the total number of pages from the first request.
I'm using total_pages = json.loads(response.text)['total_pages']
as an example of accessing where the total pages within the json file within the parse
function.
url = "https://hkapi.centanet.com/api/Transaction/Map.json"
headers = {
'lang': 'tc',
'Content-Type': 'application/json; charset=UTF-8',
'Connection': 'Keep-Alive',
'User-Agent': 'okhttp/4.7.2'
}
first_payload = {
"daterange": 180,
"postType": "s",
"refdate": "20200701",
"order": "desc",
"page": "1",
"pageSize": 100,
"pixelHeight": 2220,
"pixelWidth": 1080,
"points[0].lat": 22.695053063373795,
"points[0].lng": 113.85844465345144,
"points[1].lat": 22.695053063373795,
"points[1].lng": 114.38281349837781,
"points[2].lat": 21.993328259196705,
"points[2].lng": 114.38281349837781,
"points[3].lat": 21.993328259196705,
"points[3].lng": 113.85844465345144,
"sort": "score",
"zoom": 9.745128631591797,
"platform": "android"
}
def start_requests(self):
yield scrapy.Request(url=self.url, callback=self.parse, method="POST", headers=self.headers, body=json.dumps(self.first_payload))
def parse(self,response):
total_pages = json.loads(response.text)['total_pages']
for i in range(2,total_pages+1):
page = i
payload = {
"daterange": 180,
"postType": "s",
"refdate": "20200701",
"order": "desc",
"page": f"{page}",
"pageSize": 100,
"pixelHeight": 2220,
"pixelWidth": 1080,
"points[0].lat": 22.695053063373795,
"points[0].lng": 113.85844465345144,
"points[1].lat": 22.695053063373795,
"points[1].lng": 114.38281349837781,
"points[2].lat": 21.993328259196705,
"points[2].lng": 114.38281349837781,
"points[3].lat": 21.993328259196705,
"points[3].lng": 113.85844465345144,
"sort": "score",
"zoom": 9.745128631591797,
"platform": "android"
}
yield scrapy.Request(url=self.url, callback=self.parse_new_requests, method="POST", headers=self.headers, body=json.dumps(payload))
def parse_new_requests(self,response):
json_response = json.loads(response.text)
yield json_response
We make a first request to grab the total_page variable. Then we define that total_pages
within the parse function. We can then use that to make a for loop in range(2,total_page+1)
, as we don't need the 1st page. Each specific payload is created and then we pass that payload onto parse_new_requests
.
Upvotes: 3