Reputation:
I am continuing a scrapy project from an earlier question: scrapy output item as 1 list element per row I have my scrapy code returning data from ufc events in one parse method and subsequently returning totals and round-by-round data for each event match in an additional parse method (separate links).
The scraped data returned in the resulting csv file is correct. However the formatting is problematic:
event_name event_date event_loc attendance wclass method mthdtl finround fintime winner loser bout fighters method_txt mthdtl_txt m_finround m_fintime timefrmt ref w_kd l_kd w_sigstr l_sigstr w_sigstr_perc l_sigstr_perc w_tot_str l_tot_str w_td l_td w_td_perc l_td_perc w_sub_att l_sub_att w_pass l_pass w_rev l_rev r1_w_kd r1_w_tot_str r1_w_td r1_w_td_perc r1_w_sub_att r1_w_pass r1_w_rev r1_l_kd r1_l_tot_str r1_l_td r1_l_td_perc r1_l_sub_att r1_l_pass r1_l_rev r1_w_sigstr r1_l_sigstr r1_w_sigstr_perc r1_w_sigstr_perc r1_w_sigstr_head r1_l_sigstr_head r1_w_sigstr_body r1_l_sigstr_body r1_w_sigstr_leg r1_l_sigstr_leg r1_w_sigstr_dist r1_l_sigstr_dist r1_w_sigstr_clinch r1_l_sigstr_clinch r1_w_sigstr_ground r1_l_sigstr_ground r2_w_kd r2_w_tot_str r2_w_td r2_w_td_perc r2_w_sub_att r2_w_pass r2_w_rev r2_l_kd r2_l_tot_str r2_l_td r2_l_td_perc r2_l_sub_att r2_l_pass r2_l_rev r2_w_sigstr r2_l_sigstr r2_w_sigstr_perc r2_w_sigstr_perc r2_w_sigstr_head r2_l_sigstr_head r2_w_sigstr_body r2_l_sigstr_body r2_w_sigstr_leg r2_l_sigstr_leg r2_w_sigstr_dist r2_l_sigstr_dist r2_w_sigstr_clinch r2_l_sigstr_clinch r2_w_sigstr_ground r2_l_sigstr_ground r3_w_kd r3_w_tot_str r3_w_td r3_w_td_perc r3_w_sub_att r3_w_pass r3_w_rev r3_l_kd r3_l_tot_str r3_l_td r3_l_td_perc r3_l_sub_att r3_l_pass r3_l_rev r3_w_sigstr r3_l_sigstr r3_w_sigstr_perc r3_w_sigstr_perc r3_w_sigstr_head r3_l_sigstr_head r3_w_sigstr_body r3_l_sigstr_body r3_w_sigstr_leg r3_l_sigstr_leg r3_w_sigstr_dist r3_l_sigstr_dist r3_w_sigstr_clinch r3_l_sigstr_clinch r3_w_sigstr_ground r3_l_sigstr_ground r4_w_kd r4_w_tot_str r4_w_td r4_w_td_perc r4_w_sub_att r4_w_pass r4_w_rev r4_l_kd r4_l_tot_str r4_l_td r4_l_td_perc r4_l_sub_att r4_l_pass r4_l_rev r4_w_sigstr r4_l_sigstr r4_w_sigstr_perc r4_w_sigstr_perc r4_w_sigstr_head r4_l_sigstr_head r4_w_sigstr_body r4_l_sigstr_body r4_w_sigstr_leg r4_l_sigstr_leg r4_w_sigstr_dist r4_l_sigstr_dist r4_w_sigstr_clinch r4_l_sigstr_clinch r4_w_sigstr_ground r4_l_sigstr_ground r5_w_kd r5_w_tot_str r5_w_td r5_w_td_perc r5_w_sub_att r5_w_pass r5_w_rev r5_l_kd r5_l_tot_str r5_l_td r5_l_td_perc r5_l_sub_att r5_l_pass r5_l_rev r5_w_sigstr r5_l_sigstr r5_w_sigstr_perc r5_w_sigstr_perc r5_w_sigstr_head r5_l_sigstr_head r5_w_sigstr_body r5_l_sigstr_body r5_w_sigstr_leg r5_l_sigstr_leg
UFC 241: Cormier vs. Miocic 2 August 17, 2019 Anaheim, California, USA 17,304 Heavyweight,, KO/TKO Punches 4 04:09 Stipe Miocic Daniel Cormier
UFC 241: Cormier vs. Miocic 2 August 17, 2019 Anaheim, California, USA 17,304 Welterweight, U-DEC 3 05:00 Nate Diaz Anthony Pettis
UFC 241: Cormier vs. Miocic 2 August 17, 2019 Anaheim, California, USA 17,304 Middleweight,, U-DEC 3 05:00 Paulo Costa Yoel Romero
Welterweight Bout Anthony Pettis,Nate Diaz Decision - Unanimous 3 05:00 3 Rnd (5-5-5) Mike Beltran,Guilherme Bravo,Derek Cleary,Ron McCarthy 0 1 69 of 133 114 of 201 51% 56% 86 of 153 205 of 306 0 of 0 1 of 1 0% 100% 1 0 0 4 2 1 0 23 of 41 0 of 0 0% 1 0 0 0 62 of 88 1 of 1 100% 0 2 0
14 of 31 22 of 42 45% 45% 9 of 22 15 of 33 2 of 2 5 of 6 3 of 7 2 of 3 9 of 24 9 of 23 5 of 7 6 of 9 0 of 0 7 of 10 0 40 of 70 0 of 0 0% 0 0 0 0 65 of 114 0 of 0 0% 0 0 0 36 of 66 54 of 100 54% 54% 28 of 55 45 of 87 7 of 9 7 of 11 1 of 2 2 of 2 26 of 54 29 of 63 10 of 12 25 of 37 0 of 0 0 of 0 0 23 of 42 0 of 0 0% 0 0 2 1 78 of 104 0 of 0 0% 0 2 1 19 of 36 38 of 59 52% 52% 17 of 34 34 of 52 1 of 1 4 of 6 1 of 1 0 of 1 11 of 24 13 of 23 5 of 8 12 of 17 3 of 4 13 of 19
Middleweight Bout Yoel Romero,Paulo Costa Decision - Unanimous 3 05:00 3 Rnd (5-5-5) Jason Herzog,Guilherme Bravo,Ron McCarthy,Michael Bell 1 1 125 of 284 118 of 213 44% 55% 125 of 284 118 of 213 1 of 4 0 of 0 25% 0% 0 0 0 0 0 0 1 32 of 69 0 of 2 0% 0 0 0 1 37 of 69 0 of 0 0% 0 0 0
32 of 69 37 of 69 46% 46% 23 of 54 19 of 46 2 of 7 16 of 20 7 of 8 2 of 3 31 of 68 32 of 61 1 of 1 2 of 2 0 of 0 3 of 6 0 40 of 91 1 of 1 100% 0 0 0 0 37 of 71 0 of 0 0% 0 0 0 40 of 91 37 of 71 43% 43% 28 of 77 24 of 53 6 of 7 12 of 17 6 of 7 1 of 1 39 of 90 36 of 70 1 of 1 1 of 1 0 of 0 0 of 0 0 53 of 124 0 of 1 0% 0 0 0 0 44 of 73 0 of 0 0% 0 0 0 53 of 124 44 of 73 42% 42% 45 of 113 24 of 49 3 of 6 18 of 21 5 of 5 2 of 3 48 of 118 42 of 71 5 of 6 2 of 2 0 of 0 0 of 0
UFC Heavyweight Title Bout Daniel Cormier,Stipe Miocic KO/TKO Punches to Head At Distance 4 04:09 5 Rnd (5-5-5-5-5) Herb Dean 0 1 181 of 263 123 of 229 68% 53% 230 of 317 135 of 244 1 of 3 1 of 3 33% 33% 0 0 2 0 0 0 0 71 of 83 1 of 2 50% 0 2 0 0 9 of 18 0 of 0 0% 0 0 0
37 of 46 7 of 13 80% 80% 25 of 34 3 of 8 7 of 7 0 of 0 5 of 5 4 of 5 13 of 16 6 of 12 3 of 3 0 of 0 21 of 27 1 of 1 0 59 of 85 0 of 0 0% 0 0 0 0 48 of 84 0 of 0 0% 0 0 0 56 of 82 46 of 82 68% 68% 56 of 81 37 of 72 0 of 0 8 of 9 0 of 1 1 of 1 45 of 68 42 of 76 11 of 14 4 of 6 0 of 0 0 of 0 0 69 of 100 0 of 1 0% 0 0 0 0 40 of 73 1 of 3 33% 0 0 0 57 of 86 34 of 67 66% 66% 53 of 82 28 of 61 1 of 1 5 of 5 3 of 3 1 of 1 50 of 76 24 of 50 7 of 10 10 of 17 0 of 0 0 of 0 0 31 of 49 0 of 0 0% 0 0 0 1 38 of 69 0 of 0 0% 0 0 0 31 of 49 36 of 67 63% 63% 28 of 46 18 of 47 1 of 1 14 of 16 2 of 2 4 of 4 31 of 49 30 of 57 0 of 0 5 of 5 0 of 0 1 of 5
First, the items from the first and second parse methods appear on separate rows. These second items are kind of subset as a separate block completely to the right and below the first parse method items.
Subsequently, within the second parse method items (below and to the right of the first block of item rows) the items skip a row to accommodate round-by-round data from an if-elif-else condition. This data which is slotted between these rows. I am using items and itemloaders but I am not currently using any custom item pipelines. I run the spider from the command line and output to csv with:
scrapy crawl stats -o stats.csv
Abbreviated code:
class StatsSpider(scrapy.Spider):
name = 'stats'
allowed_domains = ['ufcstats.com']
start_urls = ['http://ufcstats.com/statistics/events/completed?page=all']
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,}
#ITEM_PIPELINES = {'stats.pipelines.StatsPipeline': 300,}
custom_settings = {# specifies exported fields and order
'FEED_EXPORT_FIELDS': [ *extensive feed_export_fields* ]}
def parse(self, response):
rev_orderd_events = response.css('tr.b-statistics__table-row')
# full event_links
# event_links = rev_orderd_events.css('i>a::attr(href)').extract()
# for url in event_links:
# yield scrapy.Request(url=event_links, callback=self.parse_event)
event_links = rev_orderd_events.css('i>a::attr(href)')[3].extract()
# for links in event_links:
# yield scrapy.Request(url=links,callback=self.parse_event)
yield scrapy.Request(url=event_links,callback=self.parse_event,dont_filter=True)
def parse_event(self, response):
pg = response.css('div.l-page__container')
for event in response.css('div.b-fight-details'):
event_name = pg.css('h2.b-content__title>span::text').extract_first()
event_date = event.css('ul.b-list__box-list>li:nth-child(1)::text').extract()
event_loc = event.css('ul.b-list__box-list>li:nth-child(2)::text').extract()
attendance = event.css('ul.b-list__box-list>li:nth-child(3)::text').extract()
child(odd)::text').extract()
for fights in event.css('tr')[1:]:
il = ItemLoader(StatsItem(), selector=fights)
il.add_value('event_name', event_name)
il.add_value('event_date', event_date)
il.add_value('event_loc', event_loc)
il.add_value('attendance', attendance)
il.add_css('winner', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(odd)>a::text')
il.add_css('loser', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(even)>a::text')
il.add_css('wclass','td.b-fight-details__table-col:nth-child(7)>p:nth-child(1)::text')
il.add_css('method','td.b-fight-details__table-col:nth-child(8)>p:nth-child(odd)::text')
il.add_css('mthdtl','td.b-fight-details__table-col:nth-child(8)>p:nth-child(even)::text')
il.add_css('finround','td.b-fight-details__table-col:nth-child(9)>p:nth-child(odd)::text')
il.add_css('fintime','td.b-fight-details__table-col:nth-child(10)>p:nth-child(odd)::text')
yield il.load_item()
match_links = pg.css('tr>td:nth-child(1) a::attr(href)').extract()
for links in match_links:
yield scrapy.Request(url=links, callback=self.parse_match)
def parse_match(self, response):
section = response.css('section.b-statistics__section_details')
f_dtl = section.css('div.b-fight-details')
# m_event = section.css('h2>a::text').extract()
m_info = f_dtl.css('div.b-fight-details__fight div i::text').extract()
m_fin_dtl = f_dtl.css('div.b-fight-details__content>p::text').extract()
ref = f_dtl.css('div.b-fight-details__content i>span::text').extract()
#table_rows = f_dtl.css('tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()
#timefrmt = f_dtl.css('div.b-fight-details__fight div i::text')[15].extract()
fighters = f_dtl.css('table:nth-child(1) tr.b-fight-details__table-row>td.b-fight-details__table-col>p>a::text').extract()
m_totals = f_dtl.css('table:nth-child(1) tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()
rounds = f_dtl.css('table:nth-child(2) tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()
for info in section:
il = ItemLoader(StatsItem(), selector=section)
il.add_value('bout', m_info)
il.add_value('method_txt', m_info)
il.add_value('mthdtl_txt' , m_fin_dtl)
il.add_value('m_finround' , m_info)
il.add_value('m_fintime', m_info)
il.add_value('timefrmt', m_info)
il.add_value('ref', ref)
il.add_value('fighters', fighters)
il.add_value('w_kd', m_totals)
il.add_value('w_sigstr', m_totals)
il.add_value('w_sigstr_perc', m_totals)
il.add_value('w_tot_str', m_totals)
il.add_value('w_td', m_totals)
il.add_value('w_td_perc', m_totals)
il.add_value('w_sub_att', m_totals)
il.add_value('w_pass', m_totals)
il.add_value('w_rev', m_totals)
il.add_value('l_kd', m_totals)
il.add_value('l_sigstr', m_totals)
il.add_value('l_sigstr_perc', m_totals)
il.add_value('l_tot_str', m_totals)
il.add_value('l_td', m_totals)
il.add_value('l_td_perc', m_totals)
il.add_value('l_sub_att', m_totals)
il.add_value('l_pass', m_totals)
il.add_value('l_rev', m_totals)
il.add_value('r1_w_kd', rounds)
# il.add_value('r1_w_sigstr', rounds)
# il.add_value('r1_w_sigstr_perc', rounds)
il.add_value('r1_w_tot_str', rounds)
il.add_value('r1_w_td', rounds)
il.add_value('r1_w_td_perc', rounds)
il.add_value('r1_w_sub_att', rounds)
il.add_value('r1_w_pass', rounds)
il.add_value('r1_w_rev', rounds)
il.add_value('r1_l_kd', rounds)
# il.add_value('r1_l_sigstr', rounds)
# il.add_value('r1_l_sigstr_perc', rounds)
il.add_value('r1_l_tot_str', rounds)
il.add_value('r1_l_td', rounds)
il.add_value('r1_l_td_perc', rounds)
il.add_value('r1_l_sub_att', rounds)
il.add_value('r1_l_pass', rounds)
il.add_value('r1_l_rev', rounds)
yield il.load_item()
if len(rounds) == 42:
r1 = ItemLoader(round_1_items(), selector = section)
r1...
yield r1.load_item()
elif len(rounds) == 84:
r2 = ItemLoader(round_2_items(), selector = section)
r2...
yield r2.load_item()
elif len(rounds) == 126:
r3 = ItemLoader(round_3_items(), selector = section)
r3...
yield r3.load_item()
elif len(rounds) == 168:
r4 = ItemLoader(round_4_items(), selector = section)
r4...
yield r4.load_item()
elif len(rounds) == 210:
r5 = ItemLoader(round_5_items(), selector = section)
r5....
yield r5.load_item()
else:
il = ItemLoader(StatsItem(), selector=section)
il.add_value('rounders', rounds)
yield il.load_item()
I would like for each item to be output as one csv row. So if the csv current csv output is like:
1 (block of rows) 2a 2b (alternating total/round detail rows)
I want my csv to be:
1 - 2a - 2b...
Upvotes: 0
Views: 271
Reputation: 5389
It took me a while to understand your question/problem, so apologies if my answer is not correct.
scrapy
will write a new line to the output each time you yield
an item, so you should only yield
when you have a complete StatsItem
. If it is essential that your data must be parsed from two different pages, you can create your item in parse_event
and then pass it through to the parse_match
function partially filled using either cb_kwargs
(introduced in scrapy-1.7
) or the meta
argument of Request
.
So in parse_event
you'd have
yield scrapy.Request(..., callback=self.parse_match,
cb_kwargs={'item': il.load_item()})
and then you can modify parse_match
to take item
as an argument
def parse_match(self, response, item):
...
# Later on
il = ItemLoader(item, selector=section)
# Fill rest of item
In conclusion, try to only do yield il.load_item()
once.
Upvotes: 1