Reputation: 23
I believe there is a better way to get response using scrapy.Request then I do
...
import urllib.request
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
...
class MatchResultsSpider(scrapy.Spider):
name = 'match_results'
allowed_domains = ['site.com']
start_urls = ['url.com']
def get_detail_page_data(self, detail_url):
req = urllib.request.Request(
detail_url,
data=None,
headers={
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'site.com',
}
)
page = urllib.request.urlopen(req)
response = HtmlResponse(url=detail_url, body=page.read())
target = Selector(response=response)
return target.xpath('//dd[@data-first_name]/text()').extract_first()
I get all information inside parse function. But in one place I need to get a little peace data from inside detail page.
# Lineups
lineup_team_tables = lineups_container.xpath('.//tbody')
for i, table in enumerate(lineup_team_tables):
# lineup players
line_up = []
lineup_players = table.xpath('./tr[not(contains(string(), "Coach"))]')
for lineup_player in lineup_players:
line_up_entries = {}
lineup_player_url = lineup_player.xpath('.//a/@href').extract_first()
line_up_entries['player_id'] = get_id(lineup_player_url)
line_up_entries['jersey_num'] = lineup_player.xpath('./td[@class="shirtnumber"]/text()').extract_first()
abs_lineup_player_url = response.urljoin(lineup_player_url)
line_up_entries['position_id_detail'] = self.get_detail_page_data(abs_lineup_player_url)
line_up.append(line_up_entries)
# team_lineup['line_up'] = line_up
self.write_to_scuard(i, 'line_up', line_up)
Can I get data from other page using scrapy.Request(detail_url, calback_func)?
Thank for your help!
Upvotes: 1
Views: 243
Reputation: 142
Too much extra code. Use simple scheme of Scrapy parsing:
class ********(scrapy.Spider):
name = '*******'
domain = '****'
allowed_domains = ['****']
start_urls = ['https://******']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64;AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'DEFAULT_REQUEST_HEADERS': {
'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'ACCEPT_ENCODING': 'gzip, deflate, br',
'ACCEPT_LANGUAGE': 'en-US,en;q=0.9',
'CONNECTION': 'keep-alive',
}
def parse(self, response):
(You already have responsed html start_urls = ['https://******'])
yield scrapy.Request(url, callback=self.parse_details)
then you can parse further (nested). And return back to parse callback:
def parse_details(self, response):
************
yield scrapy.Request(url_2, callback=self.parse)
Upvotes: 1