Reputation: 23
I'm pretty new to the world of web scraping so am looking for some guidance to an issue I've been trying to resolve for a few hours.
I'm trying to loop through a table looking structure (it's not an actual table though) and have used findall to bring back all the details of a certain tag.
The challenge I have is that every element of the "table" has the same class name "final-leaderboard__content" so I'm left with a huge list so I want to iterate through and retrieve the details for so I can create a csv/excel with the details. This is the code below
from bs4 import BeautifulSoup
import requests
TournamentURL = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/"
TournamentResponse = requests.get(TournamentURL)
TournamentData = TournamentResponse.text
TournamentSoup = BeautifulSoup(TournamentData, 'html.parser')
RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
for RowContent in RowContents:
The result is something like this and I can't work out the best way without there being any explicit tag/id to know that item 0,8,16 etc is the Player Name, item 1,9,17 is the Finish etc etc
[0] - Name
[1] - Finish
[2] - R1
[3] - R2
[4] - R3
[5] - R4
[6] - Total
[7] - Par
[8] - Name (The second Name)
[9] - Finish (The second Finish)
etc
etc
I've tried splice, modulo and various other variants of the same but can't seem to work it out.
Upvotes: 2
Views: 2063
Reputation: 4872
Another way is to create a dictionary , enumerate through your Rowcontents
and update the dictionary with key as enumerated index(i
) mod 8 (i%8
) and value "the text"
RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
d={}
for i, RowContent in enumerate(RowContents):
key = (i)%8
d.setdefault(key, []).append(' '.join(RowContent.text.strip().split()))
>>> d
{
0: ['Name','Jamie ANDERSON Champion Golfer','Andrew KIRKALDY','Jamie ALLAN',....]
1: ['Finish','1','2','2','4',....]
2: ['R1','84','86','88','89','87',....]
.......
7: ['Par','M/C','M/C','M/C','M/C','M/C',.....]
if you can use pandas
df = pd.DataFrame(d).rename(columns=df.iloc[0]).drop(df.index[0])
>>> print(df)
Name Finish R1 R2 R3 R4 Total Par
1 Jamie ANDERSON Champion Golfer 1 84 85 - - 169 M/C
2 Andrew KIRKALDY 2 86 86 - - 172 M/C
3 Jamie ALLAN 2 88 84 - - 172 M/C
4 George PAXTON 4 89 85 - - 174 M/C
5 Tom KIDD 5 87 88 - - 175 M/C
6 Bob FERGUSON 6 89 87 - - 176 M/C
7 J.O.F. MORRIS 7 92 87 - - 179 M/C
to save a dataframe to csv use pandas.to_csv()
df.to_csv('yourfile.csv', index=False)
Upvotes: 1
Reputation: 20052
You can use the fact that this is indeed a kind of tabular data and grab all divs
that represent a row, split it by a number of columns, and there's your data:
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
page = requests.get(url).content
leaderboard = BeautifulSoup(page, "html.parser").find_all("div", {"class": "final-leaderboard__content"})
column_count = 8
split_by_columns = [
leaderboard[i:i+column_count] for i in range(0, len(leaderboard), column_count)
]
table = [[i.getText(strip=True) for i in row] for row in split_by_columns]
print(tabulate(table[1:], headers=table[0]))
Output:
Name Finish R1 R2 R3 R4 Total Par
----------------------------- -------- ---- ---- ---- ---- ------- -----
Jamie ANDERSONChampion Golfer 1 84 85 - - 169 M/C
Andrew KIRKALDY 2 86 86 - - 172 M/C
Jamie ALLAN 2 88 84 - - 172 M/C
George PAXTON 4 89 85 - - 174 M/C
Tom KIDD 5 87 88 - - 175 M/C
Bob FERGUSON 6 89 87 - - 176 M/C
J.O.F. MORRIS 7 92 87 - - 179 M/C
Jack KIRKALDY 8 92 89 - - 181 M/C
James RENNIE 8 93 88 - - 181 M/C
Willie FERNIE 8 92 89 - - 181 M/C
David AYTON 11 95 89 - - 184 M/C
Henry LAMB 11 91 93 - - 184 M/C
Tom ARUNDEL 11 95 89 - - 184 M/C
Tom MORRIS SR 14 92 93 - - 185 M/C
William DOLEMAN 14 91 94 - - 185 M/C
Robert KINSMAN 14 88 97 - - 185 M/C
Bob MARTIN 17 93 93 - - 186 M/C
Ben SAYERS 18 92 95 - - 187 M/C
David ANDERSON SR 19 94 94 - - 188 M/C
David CORSTORPHINE 20 93 96 - - 189 M/C
Tom DUNN 20 90 99 - - 189 M/C
Peter PAXTON 20 99 90 - - 189 M/C
[A] SMITH 20 94 95 - - 189 M/C
D. GRANT 20 95 94 - - 189 M/C
Bob DOW 20 95 94 - - 189 M/C
Walter GOURLAY 20 92 97 - - 189 M/C
A.W. SMITH 27 91 99 - - 190 M/C
Douglas Argyll ROBERTSON 27 97 93 - - 190 M/C
Robert ARMIT 29 95 96 - - 191 M/C
George STRATH 29 97 94 - - 191 M/C
J.H. BLACKWELL 31 96 96 - - 192 M/C
Tom MANZIE 32 96 97 - - 193 M/C
George LOWE 33 94 100 - - 194 M/C
G. HONEYMAN 33 97 97 - - 194 M/C
James FENTON 35 99 97 - - 196 M/C
Robert TAIT 35 99 97 - - 196 M/C
Bob KIRK 37 99 98 - - 197 M/C
Rev. D. LUNDIE 37 98 99 - - 197 M/C
Fitz BOOTHBY 39 96 102 - - 198 M/C
J. Thomson WHITE 40 102 99 - - 201 M/C
James KIRK 41 105 97 - - 202 M/C
W.H. GOFF 42 105 99 - - 204 M/C
Upvotes: 4
Reputation: 14233
import requests
from bs4 import BeautifulSoup
def parse_row(row):
for div in row.find_all("div", {"class": "final-leaderboard__content"}):
yield div.text.strip().replace('\n', ' ')
url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find("div", {"class": "final-leaderboard__table"})
rows = table.find_all('div', {'class':"final-leaderboard__row"})
header = list(parse_row(rows[0]))
for row in rows[1:]:
print(dict(zip(header, list(parse_row(row)))))
output
{'Name': 'Jamie ANDERSON Champion Golfer', 'Finish': '1', 'R1': '84', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '169', 'Par': 'M/C'}
{'Name': 'Andrew KIRKALDY', 'Finish': '2', 'R1': '86', 'R2': '86', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'Jamie ALLAN', 'Finish': '2', 'R1': '88', 'R2': '84', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'George PAXTON', 'Finish': '4', 'R1': '89', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '174', 'Par': 'M/C'}
{'Name': 'Tom KIDD', 'Finish': '5', 'R1': '87', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '175', 'Par': 'M/C'}
{'Name': 'Bob FERGUSON', 'Finish': '6', 'R1': '89', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '176', 'Par': 'M/C'}
{'Name': 'J.O.F. MORRIS', 'Finish': '7', 'R1': '92', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '179', 'Par': 'M/C'}
{'Name': 'Jack KIRKALDY', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'James RENNIE', 'Finish': '8', 'R1': '93', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'Willie FERNIE', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'David AYTON', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Henry LAMB', 'Finish': '11', 'R1': '91', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom ARUNDEL', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom MORRIS SR', 'Finish': '14', 'R1': '92', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'William DOLEMAN', 'Finish': '14', 'R1': '91', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Robert KINSMAN', 'Finish': '14', 'R1': '88', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Bob MARTIN', 'Finish': '17', 'R1': '93', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '186', 'Par': 'M/C'}
{'Name': 'Ben SAYERS', 'Finish': '18', 'R1': '92', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '187', 'Par': 'M/C'}
{'Name': 'David ANDERSON SR', 'Finish': '19', 'R1': '94', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '188', 'Par': 'M/C'}
{'Name': 'David CORSTORPHINE', 'Finish': '20', 'R1': '93', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Tom DUNN', 'Finish': '20', 'R1': '90', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Peter PAXTON', 'Finish': '20', 'R1': '99', 'R2': '90', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': '[A] SMITH', 'Finish': '20', 'R1': '94', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'D. GRANT', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Bob DOW', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Walter GOURLAY', 'Finish': '20', 'R1': '92', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'A.W. SMITH', 'Finish': '27', 'R1': '91', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Douglas Argyll ROBERTSON', 'Finish': '27', 'R1': '97', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Robert ARMIT', 'Finish': '29', 'R1': '95', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'George STRATH', 'Finish': '29', 'R1': '97', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'J.H. BLACKWELL', 'Finish': '31', 'R1': '96', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '192', 'Par': 'M/C'}
{'Name': 'Tom MANZIE', 'Finish': '32', 'R1': '96', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '193', 'Par': 'M/C'}
{'Name': 'George LOWE', 'Finish': '33', 'R1': '94', 'R2': '100', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'G. HONEYMAN', 'Finish': '33', 'R1': '97', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'James FENTON', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Robert TAIT', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Bob KIRK', 'Finish': '37', 'R1': '99', 'R2': '98', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Rev. D. LUNDIE', 'Finish': '37', 'R1': '98', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Fitz BOOTHBY', 'Finish': '39', 'R1': '96', 'R2': '102', 'R3': '-', 'R4': '-', 'Total': '198', 'Par': 'M/C'}
{'Name': 'J. Thomson WHITE', 'Finish': '40', 'R1': '102', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '201', 'Par': 'M/C'}
{'Name': 'James KIRK', 'Finish': '41', 'R1': '105', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '202', 'Par': 'M/C'}
{'Name': 'W.H. GOFF', 'Finish': '42', 'R1': '105', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '204', 'Par': 'M/C'}
of course, instead of dict
you may use other data structure like namedtuple
Upvotes: 3