RiffRaffCat
RiffRaffCat

Reputation: 1041

Python scraping data online, but the csv file doesn't show correct format of data

I am trying work on a small data scraping stuff because I want to do some data analysis. For the data, I obtained from foxsports, the url link is also included in the codes. The steps are explained in the comment part. If possible, you could just paste and run.

For the data, I want to jump over 2013-2018 seasons' web pages, and scrape all the data in the table on the web pages. So my codes are here:

import requests
from lxml import html
import csv

# Set up the urls for Bayern Muenchen's Team Stats starting from 2013-14 
Season
# up to 2017-18 Season
# The data stores in the foxsports websites
urls = ["https://www.foxsports.com/soccer/bayern-munich-team-stats?competition=4&season=2013&category=STANDARD", 
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2014&category=STANDARD",
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2015&category=STANDARD",
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2016&category=STANDARD",
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2017&category=STANDARD"
]

seasons = ["2013/2014","2014/2015", "2015/2016", "2016/2017", "2017/2018"]

data = ["Season", "Team", "Name", "Games_Played", "Games_Started", "Minutes_Played", "Goals", "Assists", "Shots_On_Goal", "Shots", "Yellow_Cards", "Red_Cards"]

csvFile = "bayern_munich_team_stats_2013_18.csv"
# Having set up the dataframe and urls for various season standard stats, we
# are going to examine the xpath of the same player Lewandowski's same data feature
# for various pages (namely the different season pages)
# See if we can find some pattern

# 2017-18 Season Name xpath:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# 2016-17 Season Name xpath:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# 2015-16 Season Name xpath:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]

# tr xpath 17-18:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# tr xpath 16=17:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# tr xpath 15-16:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]

# For a single season's team stats, the tbody and tr relationship is like:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[2]

# lewandowski
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# Wagner
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[2]/td[1]/div/a/span[1]
# ********
# for each row with player names, the name proceeds with tr[num], num += 1 gives
# new name in a new row.
# ********


i = 0
for url in urls:
    print(url)
    response = requests.get(url)
    result = html.fromstring(response.content)
    j = 1
    for tr in result.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr'):
        # Except for season and team, we open foxsports webpage for the given team, here
        # Bayern Munich, and the given season, here starting from 13-14, and use F12 to
        # view page elements, look for tbody of the figure table, then copy the corresponding
        # xpath to here. Adjust the xpath as described above.

        season = seasons[i] # seasons[i] changes with i, but stays the same for each season
        data.append(season)
        team = ["FC BAYERN MUNICH"] # this doesn't change since we are extracting solely Bayern
        data.append(team)
        name =  tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[1]/div/a/span[1]' %j )
        data.append(name)
        gamep = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[2]' %j )
        data.append(gamep)
        games = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[3]' %j )
        data.append(games)
        mp =    tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[4]' %j )
        data.append(mp)
        goals = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[5]' %j )
        data.append(goals)
        assists = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[6]' %j )
        data.append(assists)
        shots_on_goal = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[7]' %j )
        data.append(shots_on_goal)
        shots = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[8]' %j )
        data.append(shots)
        yellow = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[9]' %j )
        data.append(yellow)
        red=    tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[10]' %j )
        data.append(red)
        # update j for next row of player
        j += 1
    # update i
    i += 1


with open(csvFile, "w") as file:
    writer = csv.writer(file)
    writer.writerow(data)

print("Done")

I tried to use data.extend([season, name, team, ...]) but the result is still the same, so I just appended everything here. The csv file content is not what I expected, and as you can see here in the picture: enter image description here

I am not quite sure where went wrong, it shows the result "Element span at XXXXXX#####", and I am still a new fish to programming. I'd really appreciate it if anyone could help me with this issue, so I can keep going on for this little project, which is only for educational purpose. Thank you very much for your time and help!

Upvotes: 1

Views: 219

Answers (1)

Nihal
Nihal

Reputation: 5344

this is what you can do

I have done same before like this

import csv
with open(output_file, 'w', newline='') as csvfile:
            field_names = ['f6s_profile', 'linkedin_profile', 'Name', 'job_type', 'Status']
            writer = csv.DictWriter(csvfile, fieldnames=field_names)
            writer.writerow(
                {'profile': 'profile', 'profile1': 'profile1',
                 'Name': 'Name', 'job_type': 'Job Type', 'Status': 'Status'})

            for raw in data2:

            .data = []
            .# get you data using selenium
            .# data.append()
            .
                writer.writerow(
                                {'profile': data[0], 'profile1': data[1],
                                 'Name': name_person, 'job_type': data[2], 'Status': status})

where first writer.writerow will be you header and field_names are just used as key to fill you data to perticular column

to get the value of [<Element td at 0x151ca980638>] you can use data.append(name.text)

you can also do this add .text after your xpath

name =  tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[1]/div/a/span[1]' %j ).text
data.append(name)

Upvotes: 1

Related Questions