lydol
lydol

Reputation: 121

Nested string in a list - need to split nested string to help turn it into a dataframe

I'm working on a web scraping project with nba stats. When I am scraping, I can get all of the information. However, all of the stats are returning as one string, which, turned into a dataframe, puts all the stats in one column. I'm attempting to split this string. and replace it in it's own nested area. Hopefully the image explains this better.

I am webscraping from https://stats.nba.com/players/traditional/?sort=PTS&dir=-1 using selenium because I am planning on clicking though all of the pages

code I've done so far

here is the function I'm working on: In the last line I would like to replace z[2] with the split version I've created. When I try z[2] = z[2].split(' ') I get the error AttributeError: 'list' object has no attribute 'split'

new_split = []
for i in player:
    player_stats.append(i.text.split('\n'))
    for z in player_stats:
        new_split.append(z[2].split(' '))```

Upvotes: 0

Views: 50

Answers (2)

Paul M.
Paul M.

Reputation: 10809

You didn't mention where you're getting your data from. (I've updated the url in my code. It's still the same API, which returns information for all 457 players, so there is no need to use selenium to navigate to the other pages). The official nba website seems to be offering their data in JSON format, which is always desirable when web scraping:

import requests
import json

# url = "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2019-20&SeasonType=Regular+Season&StatCategory=PTS"

url = "https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=&Weight="

response = requests.get(url)
response.raise_for_status()

data = json.loads(response.text)

players = []
for player_data in data["resultSet"]["rowSet"]:
    player = dict(zip(data["resultSet"]["headers"], player_data))
    players.append(player)


for player in players[:10]:
    print(f"{player['PLAYER']} ({player['TEAM_ABBREVIATION']}) is rank {player['RANK']} with a GP of {player['GP']}")

Output:

James Harden (HOU) is rank 1 with a GP of 18
Giannis Antetokounmpo (MIL) is rank 2 with a GP of 19
Luka Doncic (DAL) is rank 3 with a GP of 18
Bradley Beal (WAS) is rank 4 with a GP of 17
Trae Young (ATL) is rank 5 with a GP of 18
Damian Lillard (POR) is rank 6 with a GP of 18
Karl-Anthony Towns (MIN) is rank 7 with a GP of 16
Anthony Davis (LAL) is rank 8 with a GP of 18
Brandon Ingram (NOP) is rank 9 with a GP of 15
LeBron James (LAL) is rank 10 with a GP of 19

Note: I have no idea what a "GP" is - I just picked that for demonstration. Here's a screenshot of Chrome's network logger, showing a small part of the expanded JSON resource (EDIT The json response from the new url looks exactly the same, except some of the headers are different, like "TEAM" -> "TEAM_ABBREVIATION"):

You can see the values - which you're struggling to extract out of one giant string - nicely separated into separate elements. The code I posted above creates key-value pairs using the headers ("PLAYER_ID", "RANK", etc. found in data["resultSet"]["headers"]) and these values.

Upvotes: 1

c0rias
c0rias

Reputation: 143

If the second column is a string, you could try to split this string into a list, turn each element of this list into a series, and then concat this new data frame with the first two columns of the original data frame.

df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)

df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)

Example:

df = pd.DataFrame({"0": [1, 2],
                   "1": ["Name1", "Name2"],
                   "2":[["HOU 30 80"], ["LA 30 50"]]})

df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)

df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)

    0   1       0   1   2
0   1   Name1   HOU 30  80
1   2   Name2   LA  30  50

Upvotes: 0

Related Questions