clattenburg cake
clattenburg cake

Reputation: 1222

Scrapy Html Table to DataFrame With Duplicated Headers

I've hit a brick wall with my scrapy scrape of an html table. Basically, I have a piece of code that works by first assigning column names as objects, using them as keys and then appending them with corresponding xpath entries to a separate object. These are then put into a pandas dataframe, ultimately converted into a csv for final use.

import scrapy
from scrapy.selector import Selector
import re
import pandas as pd

class PostSpider(scrapy.Spider):

    name = "standard_squads"

    start_urls = [
        "https://fbref.com/en/comps/11/stats/Serie-A-Stats",
    ]

    def parse(self, response):

        column_index = 1
        columns = {}
        for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
            column_name = column_node.xpath("./text()").extract_first()
            print("column name is: " + column_name)
            columns[column_name] = column_index
            column_index += 1
            
            matches = []

        for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
            match = {}
            for column_name in columns.keys():

                if column_name=='Squad':
                    match[column_name]=row.xpath('th/a/text()').extract_first()
                else:
                    match[column_name] = row.xpath(
                        "./td[{index}]//text()".format(index=columns[column_name]-1)
                    ).extract_first()

            matches.append(match)
        
        print(matches)

        df = pd.DataFrame(matches,columns=columns.keys())

        yield df.to_csv("test_squads.csv",sep=",", index=False)

However, I just realised that the column header names in the xpath response (//*[@id="stats_standard_squads"]/thead/tr[2]/th) actually contain duplicates (for example on the page xG appears twice in the table, as does xA). Because of this when I loop through columns.keys() it tosses the duplicates away and so I only end up with 20 columns in the final csv, instead of 25.

I'm not sure what to do now- I've tried adding the column names to a list, adding them as dataframe headers and then appending to a new row each time but it seems to be a lot of boilerplate. I was hoping that there might be a simpler solution to this automated scrape which allows for duplicate names in a pandas dataframe column?

Upvotes: 0

Views: 228

Answers (1)

gangabass
gangabass

Reputation: 10666

What about reading a list of columns into an array and adding suffixes:

def parse(self, response):
    columns = []
    for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
        column_name = column_node.xpath("./text()").extract_first()
        columns.append(column_name)            

    matches = []
    for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
        match = {}
        suffixes = {}
        for column_index, column_name in enumerate(columns):
            # Get correct Index for the currect column
            if column_name not in suffixes:
                suffixes[column_name] = 1
                df_name = column_name # no suffix for the first catch
            else:
                suffixes[column_name] += 1
                df_name = f'{column_name}_{suffixes[column_name]}'

            if column_name=='Squad':
                match[df_name]=row.xpath('th/a/text()').extract_first()
            else:
                match[df_name] = row.xpath(
                    "./td[{index}]//text()".format(index=column_index)
                ).extract_first()

        matches.append(match)
    
    print(matches)

    df = pd.DataFrame(matches,columns=columns.keys())

    yield df.to_csv("test_squads.csv",sep=",", index=False)

Upvotes: 1

Related Questions