Reputation: 39
So right now when I run this, I get a final output that includes two header columns. As a result it won't let me write this to a .csv either. How would I fix this so that it only includes the column from the first table? (seeing the rest of the column names are the same throughout)
import pandas as pd
import urllib.request
import bs4 as bs
urls = ['https://fantasysportsdaily.net/bsl/boxes/1-1.html',
'https://fantasysportsdaily.net/bsl/boxes/1-2.html'
]
final = []
for url in urls:
df = pd.read_html(url, header=0)
format1 = df[1].iloc[:, : 16]
colname1 = format1.columns[0]
format1.insert(1, 'Team', colname1)
format1.rename(columns = {list(format1)[0]: 'Player'}, inplace = True)
format2 = format1.drop(format1[format1.Player == 'TEAM TOTALS'].index)
team1 = format2.drop(format2[format2.Player == 'PERCENTAGES'].index)
format3 = df[2].iloc[:, : 16]
colname2 = format3.columns[0]
format3.insert(1, 'Team', colname2)
format3.rename(columns = {list(format3)[0]: 'Player'}, inplace = True)
format4 = format3.drop(format3[format3.Player == 'TEAM TOTALS'].index)
team2 = format4.drop(format4[format4.Player == 'PERCENTAGES'].index)
both_teams = [team1, team2]
combined = pd.concat(both_teams)
final.append(combined, ignore_index=True)
print(final)
##final.to_csv ('boxes.csv', index = True, header=True)
Upvotes: 1
Views: 1401
Reputation: 11515
Please Pay attention to the following points.
pd.read_html
is using requests
underneath with a different session on each request. so that's better to use one session for the same host. That's why I've used requests.Session()
refFunction
or Class
as I've used within the code.iloc[]
is actually can drop columns and rows as well! so you don't need to circle yourself.import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0'
}
def myformat(content, key):
df = pd.read_html(content)[key].iloc[:-2, :-1]
df.insert(1, 'Team', df.columns[0])
df.rename(columns={df.columns[0]: "Player"}, inplace=True)
return df
def main(url):
with requests.Session() as req:
req.headers.update(headers)
allin = []
for num in range(1, 3):
r = req.get(url.format(num))
df1 = myformat(r.content, 1)
df2 = myformat(r.content, 2)
final = pd.concat([df1, df2], ignore_index=True)
allin.append(final)
target = pd.concat(allin, ignore_index=True)
print(target)
main('https://fantasysportsdaily.net/bsl/boxes/1-{}.html')
Output:
Player Team POS MIN FG FGA 3P ... REB A PF ST TO BL PTS
0 David Robinson Spurs C 32 6 14 0 ... 15 1.0 4.0 2.0 3.0 2.0 21.0
1 Reggie Miller Spurs SG 42 5 12 3 ... 6 3.0 0.0 2.0 3.0 0.0 17.0
2 Tom Gugliotta Spurs PF 25 6 7 0 ... 11 1.0 4.0 2.0 3.0 1.0 17.0
3 Allan Houston Spurs PG 27 5 19 2 ... 1 1.0 1.0 2.0 3.0 0.0 12.0
4 Sean Elliott Spurs SF 34 3 6 0 ... 4 0.0 3.0 2.0 2.0 0.0 7.0
5 Rik Smits Spurs PF 32 1 10 0 ... 9 1.0 4.0 0.0 1.0 0.0 6.0
6 Mark Jackson Spurs PG 21 3 9 0 ... 3 7.0 1.0 2.0 6.0 0.0 6.0
7 Will Perdue Spurs C 16 1 3 0 ... 1 1.0 1.0 1.0 0.0 1.0 4.0
8 Robert Pack Spurs SG 12 0 2 0 ... 0 1.0 1.0 0.0 0.0 0.0 0.0
9 John Starks Lakers SG 39 10 20 2 ... 7 1.0 2.0 2.0 4.0 0.0 27.0
10 Magic Johnson Lakers PG 36 7 10 1 ... 7 7.0 1.0 1.0 2.0 0.0 20.0
11 Eddie Jones Lakers SF 31 4 7 1 ... 5 3.0 0.0 2.0 2.0 0.0 12.0
12 Elden Campbell Lakers PF 24 5 10 0 ... 5 0.0 4.0 0.0 0.0 1.0 12.0
13 Cedric Ceballos Lakers PF 32 3 11 0 ... 11 3.0 6.0 4.0 7.0 0.0 10.0
14 Vlade Divac Lakers C 24 3 6 0 ... 9 1.0 5.0 1.0 1.0 1.0 6.0
15 Pervis Ellison Lakers C 18 3 4 0 ... 4 0.0 6.0 1.0 0.0 1.0 6.0
16 Nick Van Exel Lakers PG 17 3 7 0 ... 1 3.0 0.0 0.0 1.0 0.0 6.0
17 Corie Blount Lakers C 6 0 0 0 ... 4 0.0 1.0 1.0 1.0 1.0 4.0
18 Anthony Peeler Lakers SF 13 0 4 0 ... 1 1.0 0.0 0.0 2.0 0.0 0.0
19 Terry Porter Timberwolves PG 31 6 15 2 ... 4 1.0 2.0 1.0 4.0 0.0 16.0
20 Kendall Gill Timberwolves PG 26 6 10 1 ... 5 5.0 0.0 0.0 3.0 0.0 15.0
21 J.R. Rider Timberwolves SG 34 7 14 0 ... 5 4.0 4.0 0.0 6.0 1.0 14.0
22 Larry Johnson Timberwolves SF 31 3 13 0 ... 10 3.0 1.0 0.0 1.0 0.0 8.0
23 LaPhonso Ellis Timberwolves PF 30 1 13 0 ... 15 2.0 3.0 0.0 1.0 1.0 6.0
24 J.R. Reid Timberwolves PF 18 1 4 0 ... 3 0.0 1.0 2.0 3.0 0.0 4.0
25 Mark Davis Timberwolves SF 17 1 3 0 ... 3 0.0 0.0 0.0 1.0 0.0 2.0
26 Eric Riley Timberwolves C 13 1 2 0 ... 5 0.0 0.0 1.0 1.0 0.0 2.0
27 Kevin Garnett Timberwolves C 35 0 8 0 ... 9 2.0 2.0 2.0 0.0 2.0 1.0
28 Micheal Williams Timberwolves PG 5 0 2 0 ... 2 1.0 1.0 0.0 0.0 0.0 0.0
29 Jim McIlvaine Bullets C 30 5 8 0 ... 6 1.0 2.0 0.0 1.0 6.0 16.0
30 Ledell Eackles Bullets SG 30 6 9 2 ... 9 1.0 2.0 3.0 2.0 0.0 15.0
31 Juwan Howard Bullets PF 29 4 10 0 ... 6 1.0 2.0 1.0 1.0 0.0 15.0
32 Avery Johnson Bullets PG 35 6 16 0 ... 2 6.0 2.0 2.0 2.0 0.0 14.0
33 Tim Legler Bullets SF 28 5 13 0 ... 2 0.0 0.0 1.0 1.0 0.0 10.0
34 David Benoit Bullets C 18 2 8 1 ... 10 1.0 1.0 1.0 1.0 0.0 7.0
35 Brent Price Bullets SG 18 2 6 1 ... 2 2.0 1.0 1.0 0.0 0.0 5.0
36 Rasheed Wallace Bullets SF 22 2 6 0 ... 1 2.0 1.0 1.0 0.0 0.0 4.0
37 Cory Alexander Bullets PG 9 0 2 0 ... 2 2.0 0.0 0.0 3.0 0.0 1.0
38 Mitchell Butler Bullets PF 19 0 1 0 ... 6 0.0 1.0 0.0 0.0 0.0 0.0
[39 rows x 17 columns]
Upvotes: 2
Reputation: 30022
pandas.concat() can concatenate a list of same structure pandas objects into one:
final = []
for url in urls:
...
combined = pd.concat(both_teams)
final.append(combined)
final_df = pd.concat(final, ignore_index=True)
print(final_df)
Upvotes: 2