Reputation: 1222
I've hit a brick wall with my scrapy scrape of an html table. Basically, I have a piece of code that works by first assigning column names as objects, using them as keys and then appending them with corresponding xpath entries to a separate object. These are then put into a pandas
dataframe, ultimately converted into a csv for final use.
import scrapy
from scrapy.selector import Selector
import re
import pandas as pd
class PostSpider(scrapy.Spider):
name = "standard_squads"
start_urls = [
"https://fbref.com/en/comps/11/stats/Serie-A-Stats",
]
def parse(self, response):
column_index = 1
columns = {}
for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
print("column name is: " + column_name)
columns[column_name] = column_index
column_index += 1
matches = []
for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
match = {}
for column_name in columns.keys():
if column_name=='Squad':
match[column_name]=row.xpath('th/a/text()').extract_first()
else:
match[column_name] = row.xpath(
"./td[{index}]//text()".format(index=columns[column_name]-1)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)
However, I just realised that the column header names in the xpath response (//*[@id="stats_standard_squads"]/thead/tr[2]/th
) actually contain duplicates (for example on the page xG
appears twice in the table, as does xA
). Because of this when I loop through columns.keys()
it tosses the duplicates away and so I only end up with 20 columns in the final csv, instead of 25.
I'm not sure what to do now- I've tried adding the column names to a list, adding them as dataframe headers and then appending to a new row each time but it seems to be a lot of boilerplate. I was hoping that there might be a simpler solution to this automated scrape which allows for duplicate names in a pandas dataframe column?
Upvotes: 0
Views: 228
Reputation: 10666
What about reading a list of columns into an array and adding suffixes:
def parse(self, response):
columns = []
for column_node in response.xpath('//*[@id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
columns.append(column_name)
matches = []
for row in response.xpath('//*[@id="stats_standard_squads"]/tbody/tr'):
match = {}
suffixes = {}
for column_index, column_name in enumerate(columns):
# Get correct Index for the currect column
if column_name not in suffixes:
suffixes[column_name] = 1
df_name = column_name # no suffix for the first catch
else:
suffixes[column_name] += 1
df_name = f'{column_name}_{suffixes[column_name]}'
if column_name=='Squad':
match[df_name]=row.xpath('th/a/text()').extract_first()
else:
match[df_name] = row.xpath(
"./td[{index}]//text()".format(index=column_index)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)
Upvotes: 1