Reputation: 121
I'm working on a web scraping project with nba stats. When I am scraping, I can get all of the information. However, all of the stats are returning as one string, which, turned into a dataframe, puts all the stats in one column. I'm attempting to split this string. and replace it in it's own nested area. Hopefully the image explains this better.
I am webscraping from https://stats.nba.com/players/traditional/?sort=PTS&dir=-1 using selenium because I am planning on clicking though all of the pages
here is the function I'm working on: In the last line I would like to replace z[2] with the split version I've created. When I try z[2] = z[2].split(' ') I get the error AttributeError: 'list' object has no attribute 'split'
new_split = []
for i in player:
player_stats.append(i.text.split('\n'))
for z in player_stats:
new_split.append(z[2].split(' '))```
Upvotes: 0
Views: 50
Reputation: 10809
You didn't mention where you're getting your data from. (I've updated the url
in my code. It's still the same API, which returns information for all 457 players, so there is no need to use selenium to navigate to the other pages). The official nba website seems to be offering their data in JSON format, which is always desirable when web scraping:
import requests
import json
# url = "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2019-20&SeasonType=Regular+Season&StatCategory=PTS"
url = "https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=&Weight="
response = requests.get(url)
response.raise_for_status()
data = json.loads(response.text)
players = []
for player_data in data["resultSet"]["rowSet"]:
player = dict(zip(data["resultSet"]["headers"], player_data))
players.append(player)
for player in players[:10]:
print(f"{player['PLAYER']} ({player['TEAM_ABBREVIATION']}) is rank {player['RANK']} with a GP of {player['GP']}")
Output:
James Harden (HOU) is rank 1 with a GP of 18
Giannis Antetokounmpo (MIL) is rank 2 with a GP of 19
Luka Doncic (DAL) is rank 3 with a GP of 18
Bradley Beal (WAS) is rank 4 with a GP of 17
Trae Young (ATL) is rank 5 with a GP of 18
Damian Lillard (POR) is rank 6 with a GP of 18
Karl-Anthony Towns (MIN) is rank 7 with a GP of 16
Anthony Davis (LAL) is rank 8 with a GP of 18
Brandon Ingram (NOP) is rank 9 with a GP of 15
LeBron James (LAL) is rank 10 with a GP of 19
Note: I have no idea what a "GP" is - I just picked that for demonstration. Here's a screenshot of Chrome's network logger, showing a small part of the expanded JSON resource (EDIT The json response from the new url
looks exactly the same, except some of the headers are different, like "TEAM" -> "TEAM_ABBREVIATION"):
You can see the values - which you're struggling to extract out of one giant string - nicely separated into separate elements. The code I posted above creates key-value pairs using the headers ("PLAYER_ID", "RANK", etc. found in data["resultSet"]["headers"]
) and these values.
Upvotes: 1
Reputation: 143
If the second column is a string, you could try to split this string into a list, turn each element of this list into a series, and then concat this new data frame with the first two columns of the original data frame.
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
Example:
df = pd.DataFrame({"0": [1, 2],
"1": ["Name1", "Name2"],
"2":[["HOU 30 80"], ["LA 30 50"]]})
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
0 1 0 1 2
0 1 Name1 HOU 30 80
1 2 Name2 LA 30 50
Upvotes: 0