user11585173
user11585173

Reputation:

Scrapy output items - multiple parse methods, one row per item

I am continuing a scrapy project from an earlier question: scrapy output item as 1 list element per row I have my scrapy code returning data from ufc events in one parse method and subsequently returning totals and round-by-round data for each event match in an additional parse method (separate links).

The scraped data returned in the resulting csv file is correct. However the formatting is problematic:

event_name  event_date  event_loc   attendance  wclass  method  mthdtl  finround    fintime winner  loser   bout    fighters    method_txt  mthdtl_txt  m_finround  m_fintime   timefrmt    ref w_kd    l_kd    w_sigstr    l_sigstr    w_sigstr_perc   l_sigstr_perc   w_tot_str   l_tot_str   w_td    l_td    w_td_perc   l_td_perc   w_sub_att   l_sub_att   w_pass  l_pass  w_rev   l_rev   r1_w_kd r1_w_tot_str    r1_w_td r1_w_td_perc    r1_w_sub_att    r1_w_pass   r1_w_rev    r1_l_kd r1_l_tot_str    r1_l_td r1_l_td_perc    r1_l_sub_att    r1_l_pass   r1_l_rev    r1_w_sigstr r1_l_sigstr r1_w_sigstr_perc    r1_w_sigstr_perc    r1_w_sigstr_head    r1_l_sigstr_head    r1_w_sigstr_body    r1_l_sigstr_body    r1_w_sigstr_leg r1_l_sigstr_leg r1_w_sigstr_dist    r1_l_sigstr_dist    r1_w_sigstr_clinch  r1_l_sigstr_clinch  r1_w_sigstr_ground  r1_l_sigstr_ground  r2_w_kd r2_w_tot_str    r2_w_td r2_w_td_perc    r2_w_sub_att    r2_w_pass   r2_w_rev    r2_l_kd r2_l_tot_str    r2_l_td r2_l_td_perc    r2_l_sub_att    r2_l_pass   r2_l_rev    r2_w_sigstr r2_l_sigstr r2_w_sigstr_perc    r2_w_sigstr_perc    r2_w_sigstr_head    r2_l_sigstr_head    r2_w_sigstr_body    r2_l_sigstr_body    r2_w_sigstr_leg r2_l_sigstr_leg r2_w_sigstr_dist    r2_l_sigstr_dist    r2_w_sigstr_clinch  r2_l_sigstr_clinch  r2_w_sigstr_ground  r2_l_sigstr_ground  r3_w_kd r3_w_tot_str    r3_w_td r3_w_td_perc    r3_w_sub_att    r3_w_pass   r3_w_rev    r3_l_kd r3_l_tot_str    r3_l_td r3_l_td_perc    r3_l_sub_att    r3_l_pass   r3_l_rev    r3_w_sigstr r3_l_sigstr r3_w_sigstr_perc    r3_w_sigstr_perc    r3_w_sigstr_head    r3_l_sigstr_head    r3_w_sigstr_body    r3_l_sigstr_body    r3_w_sigstr_leg r3_l_sigstr_leg r3_w_sigstr_dist    r3_l_sigstr_dist    r3_w_sigstr_clinch  r3_l_sigstr_clinch  r3_w_sigstr_ground  r3_l_sigstr_ground  r4_w_kd r4_w_tot_str    r4_w_td r4_w_td_perc    r4_w_sub_att    r4_w_pass   r4_w_rev    r4_l_kd r4_l_tot_str    r4_l_td r4_l_td_perc    r4_l_sub_att    r4_l_pass   r4_l_rev    r4_w_sigstr r4_l_sigstr r4_w_sigstr_perc    r4_w_sigstr_perc    r4_w_sigstr_head    r4_l_sigstr_head    r4_w_sigstr_body    r4_l_sigstr_body    r4_w_sigstr_leg r4_l_sigstr_leg r4_w_sigstr_dist    r4_l_sigstr_dist    r4_w_sigstr_clinch  r4_l_sigstr_clinch  r4_w_sigstr_ground  r4_l_sigstr_ground  r5_w_kd r5_w_tot_str    r5_w_td r5_w_td_perc    r5_w_sub_att    r5_w_pass   r5_w_rev    r5_l_kd r5_l_tot_str    r5_l_td r5_l_td_perc    r5_l_sub_att    r5_l_pass   r5_l_rev    r5_w_sigstr r5_l_sigstr r5_w_sigstr_perc    r5_w_sigstr_perc    r5_w_sigstr_head    r5_l_sigstr_head    r5_w_sigstr_body    r5_l_sigstr_body    r5_w_sigstr_leg r5_l_sigstr_leg
UFC 241: Cormier vs. Miocic 2   August 17, 2019 Anaheim, California, USA    17,304  Heavyweight,,   KO/TKO  Punches 4   04:09   Stipe Miocic    Daniel Cormier                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
UFC 241: Cormier vs. Miocic 2   August 17, 2019 Anaheim, California, USA    17,304  Welterweight,   U-DEC       3   05:00   Nate Diaz   Anthony Pettis                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
UFC 241: Cormier vs. Miocic 2   August 17, 2019 Anaheim, California, USA    17,304  Middleweight,,  U-DEC       3   05:00   Paulo Costa Yoel Romero                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                            Welterweight Bout   Anthony Pettis,Nate Diaz    Decision - Unanimous        3   05:00   3 Rnd (5-5-5)   Mike Beltran,Guilherme Bravo,Derek Cleary,Ron McCarthy  0   1   69 of 133   114 of 201  51% 56% 86 of 153   205 of 306  0 of 0  1 of 1  0%  100%    1   0   0   4   2   1   0   23 of 41    0 of 0  0%  1   0   0   0   62 of 88    1 of 1  100%    0   2   0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                            14 of 31    22 of 42    45% 45% 9 of 22 15 of 33    2 of 2  5 of 6  3 of 7  2 of 3  9 of 24 9 of 23 5 of 7  6 of 9  0 of 0  7 of 10 0   40 of 70    0 of 0  0%  0   0   0   0   65 of 114   0 of 0  0%  0   0   0   36 of 66    54 of 100   54% 54% 28 of 55    45 of 87    7 of 9  7 of 11 1 of 2  2 of 2  26 of 54    29 of 63    10 of 12    25 of 37    0 of 0  0 of 0  0   23 of 42    0 of 0  0%  0   0   2   1   78 of 104   0 of 0  0%  0   2   1   19 of 36    38 of 59    52% 52% 17 of 34    34 of 52    1 of 1  4 of 6  1 of 1  0 of 1  11 of 24    13 of 23    5 of 8  12 of 17    3 of 4  13 of 19                                                                                                                                                                                                                        
                                            Middleweight Bout   Yoel Romero,Paulo Costa Decision - Unanimous        3   05:00   3 Rnd (5-5-5)   Jason Herzog,Guilherme Bravo,Ron McCarthy,Michael Bell  1   1   125 of 284  118 of 213  44% 55% 125 of 284  118 of 213  1 of 4  0 of 0  25% 0%  0   0   0   0   0   0   1   32 of 69    0 of 2  0%  0   0   0   1   37 of 69    0 of 0  0%  0   0   0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                            32 of 69    37 of 69    46% 46% 23 of 54    19 of 46    2 of 7  16 of 20    7 of 8  2 of 3  31 of 68    32 of 61    1 of 1  2 of 2  0 of 0  3 of 6  0   40 of 91    1 of 1  100%    0   0   0   0   37 of 71    0 of 0  0%  0   0   0   40 of 91    37 of 71    43% 43% 28 of 77    24 of 53    6 of 7  12 of 17    6 of 7  1 of 1  39 of 90    36 of 70    1 of 1  1 of 1  0 of 0  0 of 0  0   53 of 124   0 of 1  0%  0   0   0   0   44 of 73    0 of 0  0%  0   0   0   53 of 124   44 of 73    42% 42% 45 of 113   24 of 49    3 of 6  18 of 21    5 of 5  2 of 3  48 of 118   42 of 71    5 of 6  2 of 2  0 of 0  0 of 0                                                                                                                                                                                                                      
                                            UFC Heavyweight Title Bout  Daniel Cormier,Stipe Miocic KO/TKO  Punches to Head At Distance 4   04:09   5 Rnd (5-5-5-5-5)   Herb Dean   0   1   181 of 263  123 of 229  68% 53% 230 of 317  135 of 244  1 of 3  1 of 3  33% 33% 0   0   2   0   0   0   0   71 of 83    1 of 2  50% 0   2   0   0   9 of 18 0 of 0  0%  0   0   0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                            37 of 46    7 of 13 80% 80% 25 of 34    3 of 8  7 of 7  0 of 0  5 of 5  4 of 5  13 of 16    6 of 12 3 of 3  0 of 0  21 of 27    1 of 1  0   59 of 85    0 of 0  0%  0   0   0   0   48 of 84    0 of 0  0%  0   0   0   56 of 82    46 of 82    68% 68% 56 of 81    37 of 72    0 of 0  8 of 9  0 of 1  1 of 1  45 of 68    42 of 76    11 of 14    4 of 6  0 of 0  0 of 0  0   69 of 100   0 of 1  0%  0   0   0   0   40 of 73    1 of 3  33% 0   0   0   57 of 86    34 of 67    66% 66% 53 of 82    28 of 61    1 of 1  5 of 5  3 of 3  1 of 1  50 of 76    24 of 50    7 of 10 10 of 17    0 of 0  0 of 0  0   31 of 49    0 of 0  0%  0   0   0   1   38 of 69    0 of 0  0%  0   0   0   31 of 49    36 of 67    63% 63% 28 of 46    18 of 47    1 of 1  14 of 16    2 of 2  4 of 4  31 of 49    30 of 57    0 of 0  5 of 5  0 of 0  1 of 5

First, the items from the first and second parse methods appear on separate rows. These second items are kind of subset as a separate block completely to the right and below the first parse method items.

Subsequently, within the second parse method items (below and to the right of the first block of item rows) the items skip a row to accommodate round-by-round data from an if-elif-else condition. This data which is slotted between these rows. I am using items and itemloaders but I am not currently using any custom item pipelines. I run the spider from the command line and output to csv with:

 scrapy crawl stats -o stats.csv

Abbreviated code:

class StatsSpider(scrapy.Spider):
name = 'stats'
allowed_domains = ['ufcstats.com']
start_urls = ['http://ufcstats.com/statistics/events/completed?page=all']
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,}  
#ITEM_PIPELINES = {'stats.pipelines.StatsPipeline': 300,}
custom_settings = {# specifies exported fields and order
    'FEED_EXPORT_FIELDS': [ *extensive feed_export_fields* ]}

def parse(self, response):
    rev_orderd_events = response.css('tr.b-statistics__table-row')
    # full event_links
    # event_links = rev_orderd_events.css('i>a::attr(href)').extract()
    # for url in event_links:
    #     yield scrapy.Request(url=event_links, callback=self.parse_event)
    event_links = rev_orderd_events.css('i>a::attr(href)')[3].extract()
    # for links in event_links:
    #     yield scrapy.Request(url=links,callback=self.parse_event)
    yield scrapy.Request(url=event_links,callback=self.parse_event,dont_filter=True)
def parse_event(self, response):
    pg = response.css('div.l-page__container')
    for event in response.css('div.b-fight-details'):
        event_name = pg.css('h2.b-content__title>span::text').extract_first()
        event_date = event.css('ul.b-list__box-list>li:nth-child(1)::text').extract()
        event_loc  = event.css('ul.b-list__box-list>li:nth-child(2)::text').extract()
        attendance = event.css('ul.b-list__box-list>li:nth-child(3)::text').extract()
        child(odd)::text').extract()
        for fights in event.css('tr')[1:]: 
            il = ItemLoader(StatsItem(), selector=fights)
            il.add_value('event_name', event_name)
            il.add_value('event_date', event_date)
            il.add_value('event_loc', event_loc)
            il.add_value('attendance', attendance)
            il.add_css('winner', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(odd)>a::text')
            il.add_css('loser', 'td.b-fight-details__table-col:nth-child(2) p.b-fight-details__table-text:nth-child(even)>a::text')
            il.add_css('wclass','td.b-fight-details__table-col:nth-child(7)>p:nth-child(1)::text')
            il.add_css('method','td.b-fight-details__table-col:nth-child(8)>p:nth-child(odd)::text')
            il.add_css('mthdtl','td.b-fight-details__table-col:nth-child(8)>p:nth-child(even)::text')
            il.add_css('finround','td.b-fight-details__table-col:nth-child(9)>p:nth-child(odd)::text')
            il.add_css('fintime','td.b-fight-details__table-col:nth-child(10)>p:nth-child(odd)::text')
            yield il.load_item()

    match_links = pg.css('tr>td:nth-child(1) a::attr(href)').extract()
    for links in match_links:
        yield scrapy.Request(url=links, callback=self.parse_match)


def parse_match(self, response):
    section = response.css('section.b-statistics__section_details')
    f_dtl = section.css('div.b-fight-details')
    # m_event = section.css('h2>a::text').extract()
    m_info   = f_dtl.css('div.b-fight-details__fight div i::text').extract()
    m_fin_dtl    = f_dtl.css('div.b-fight-details__content>p::text').extract()
    ref =  f_dtl.css('div.b-fight-details__content i>span::text').extract()
    #table_rows  = f_dtl.css('tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()
    #timefrmt = f_dtl.css('div.b-fight-details__fight div i::text')[15].extract()
    fighters = f_dtl.css('table:nth-child(1) tr.b-fight-details__table-row>td.b-fight-details__table-col>p>a::text').extract()
    m_totals = f_dtl.css('table:nth-child(1) tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()
    rounds = f_dtl.css('table:nth-child(2) tr.b-fight-details__table-row>td.b-fight-details__table-col>p::text').extract()



    for info in section:
        il = ItemLoader(StatsItem(), selector=section)
        il.add_value('bout', m_info)
        il.add_value('method_txt', m_info)
        il.add_value('mthdtl_txt' , m_fin_dtl)
        il.add_value('m_finround' , m_info)
        il.add_value('m_fintime', m_info)
        il.add_value('timefrmt', m_info)
        il.add_value('ref', ref)
        il.add_value('fighters', fighters)


        il.add_value('w_kd',  m_totals)
        il.add_value('w_sigstr',  m_totals)
        il.add_value('w_sigstr_perc',  m_totals)
        il.add_value('w_tot_str',  m_totals)
        il.add_value('w_td',  m_totals)
        il.add_value('w_td_perc',  m_totals)
        il.add_value('w_sub_att',  m_totals)
        il.add_value('w_pass',  m_totals)
        il.add_value('w_rev',  m_totals)
        il.add_value('l_kd',  m_totals)
        il.add_value('l_sigstr',  m_totals)
        il.add_value('l_sigstr_perc',  m_totals)
        il.add_value('l_tot_str',  m_totals)
        il.add_value('l_td',  m_totals)
        il.add_value('l_td_perc',  m_totals)
        il.add_value('l_sub_att',  m_totals)
        il.add_value('l_pass',  m_totals)
        il.add_value('l_rev',  m_totals)

        il.add_value('r1_w_kd',  rounds)
        # il.add_value('r1_w_sigstr',  rounds)
        # il.add_value('r1_w_sigstr_perc',  rounds)
        il.add_value('r1_w_tot_str',  rounds)
        il.add_value('r1_w_td',  rounds)
        il.add_value('r1_w_td_perc',  rounds)            
        il.add_value('r1_w_sub_att',  rounds)
        il.add_value('r1_w_pass',  rounds)
        il.add_value('r1_w_rev',  rounds)
        il.add_value('r1_l_kd',  rounds)
        # il.add_value('r1_l_sigstr',  rounds)
        # il.add_value('r1_l_sigstr_perc',  rounds)
        il.add_value('r1_l_tot_str',  rounds)
        il.add_value('r1_l_td',  rounds)
        il.add_value('r1_l_td_perc',  rounds)            
        il.add_value('r1_l_sub_att',  rounds)
        il.add_value('r1_l_pass',  rounds)
        il.add_value('r1_l_rev',  rounds)
        yield il.load_item()

        if len(rounds) == 42:
            r1 = ItemLoader(round_1_items(), selector = section)
            r1...
            yield r1.load_item()

        elif len(rounds) == 84:
            r2 = ItemLoader(round_2_items(), selector = section)               
            r2...
            yield r2.load_item()

        elif len(rounds) == 126:
            r3 = ItemLoader(round_3_items(), selector = section)
            r3...
            yield r3.load_item()

        elif len(rounds) == 168:
            r4 = ItemLoader(round_4_items(), selector = section)
            r4...
            yield r4.load_item()

        elif len(rounds) == 210:
            r5 = ItemLoader(round_5_items(), selector = section)
            r5....
            yield r5.load_item()

        else:
            il = ItemLoader(StatsItem(), selector=section)
            il.add_value('rounders', rounds)
            yield il.load_item()

I would like for each item to be output as one csv row. So if the csv current csv output is like:

1 (block of rows)
  2a
    2b (alternating total/round detail rows)

I want my csv to be:

1 - 2a - 2b...

Upvotes: 0

Views: 271

Answers (1)

tomjn
tomjn

Reputation: 5389

It took me a while to understand your question/problem, so apologies if my answer is not correct.

scrapy will write a new line to the output each time you yield an item, so you should only yield when you have a complete StatsItem. If it is essential that your data must be parsed from two different pages, you can create your item in parse_event and then pass it through to the parse_match function partially filled using either cb_kwargs (introduced in scrapy-1.7) or the meta argument of Request.

So in parse_event you'd have

yield scrapy.Request(..., callback=self.parse_match, 
                     cb_kwargs={'item': il.load_item()})

and then you can modify parse_match to take item as an argument

def parse_match(self, response, item):
    ...
    # Later on
    il = ItemLoader(item, selector=section)
    # Fill rest of item

In conclusion, try to only do yield il.load_item() once.

Upvotes: 1

Related Questions