mrroweuk
mrroweuk

Reputation: 23

Python - BeautifulSoup - Iterating through findall by specific elements in list

I'm pretty new to the world of web scraping so am looking for some guidance to an issue I've been trying to resolve for a few hours.

I'm trying to loop through a table looking structure (it's not an actual table though) and have used findall to bring back all the details of a certain tag.

The challenge I have is that every element of the "table" has the same class name "final-leaderboard__content" so I'm left with a huge list so I want to iterate through and retrieve the details for so I can create a csv/excel with the details. This is the code below


from bs4 import BeautifulSoup
import requests

TournamentURL = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/"
TournamentResponse = requests.get(TournamentURL)
TournamentData = TournamentResponse.text
TournamentSoup = BeautifulSoup(TournamentData, 'html.parser')

RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
     for RowContent in RowContents:

The result is something like this and I can't work out the best way without there being any explicit tag/id to know that item 0,8,16 etc is the Player Name, item 1,9,17 is the Finish etc etc

[0] - Name
[1] - Finish
[2] - R1
[3] - R2
[4] - R3
[5] - R4
[6] - Total
[7] - Par
[8] - Name (The second Name)
[9] - Finish (The second Finish) 
etc
etc

I've tried splice, modulo and various other variants of the same but can't seem to work it out.

Upvotes: 2

Views: 2063

Answers (3)

Shijith
Shijith

Reputation: 4872

Another way is to create a dictionary , enumerate through your Rowcontents and update the dictionary with key as enumerated index(i) mod 8 (i%8) and value "the text"

RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
d={}
for i, RowContent in enumerate(RowContents):
    key = (i)%8
    d.setdefault(key, []).append(' '.join(RowContent.text.strip().split()))

>>> d
{
   0: ['Name','Jamie ANDERSON Champion Golfer','Andrew KIRKALDY','Jamie ALLAN',....]
   1: ['Finish','1','2','2','4',....]
   2: ['R1','84','86','88','89','87',....]
   .......
  7: ['Par','M/C','M/C','M/C','M/C','M/C',.....]

if you can use pandas

df = pd.DataFrame(d).rename(columns=df.iloc[0]).drop(df.index[0])
>>> print(df)  

                              Name Finish   R1   R2 R3 R4 Total  Par
1   Jamie ANDERSON Champion Golfer      1   84   85  -  -   169  M/C
2                  Andrew KIRKALDY      2   86   86  -  -   172  M/C
3                      Jamie ALLAN      2   88   84  -  -   172  M/C
4                    George PAXTON      4   89   85  -  -   174  M/C
5                         Tom KIDD      5   87   88  -  -   175  M/C
6                     Bob FERGUSON      6   89   87  -  -   176  M/C
7                    J.O.F. MORRIS      7   92   87  -  -   179  M/C

to save a dataframe to csv use pandas.to_csv()

df.to_csv('yourfile.csv', index=False)

Upvotes: 1

baduker
baduker

Reputation: 20052

You can use the fact that this is indeed a kind of tabular data and grab all divs that represent a row, split it by a number of columns, and there's your data:

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
page = requests.get(url).content

leaderboard = BeautifulSoup(page, "html.parser").find_all("div", {"class": "final-leaderboard__content"})

column_count = 8
split_by_columns = [
    leaderboard[i:i+column_count] for i in range(0, len(leaderboard), column_count)
]

table = [[i.getText(strip=True) for i in row] for row in split_by_columns]
print(tabulate(table[1:], headers=table[0]))

Output:

Name                             Finish    R1    R2  R3    R4      Total  Par
-----------------------------  --------  ----  ----  ----  ----  -------  -----
Jamie ANDERSONChampion Golfer         1    84    85  -     -         169  M/C
Andrew KIRKALDY                       2    86    86  -     -         172  M/C
Jamie ALLAN                           2    88    84  -     -         172  M/C
George  PAXTON                        4    89    85  -     -         174  M/C
Tom KIDD                              5    87    88  -     -         175  M/C
Bob FERGUSON                          6    89    87  -     -         176  M/C
J.O.F. MORRIS                         7    92    87  -     -         179  M/C
Jack KIRKALDY                         8    92    89  -     -         181  M/C
James RENNIE                          8    93    88  -     -         181  M/C
Willie FERNIE                         8    92    89  -     -         181  M/C
David AYTON                          11    95    89  -     -         184  M/C
Henry LAMB                           11    91    93  -     -         184  M/C
Tom ARUNDEL                          11    95    89  -     -         184  M/C
Tom MORRIS SR                        14    92    93  -     -         185  M/C
William DOLEMAN                      14    91    94  -     -         185  M/C
Robert KINSMAN                       14    88    97  -     -         185  M/C
Bob MARTIN                           17    93    93  -     -         186  M/C
Ben SAYERS                           18    92    95  -     -         187  M/C
David ANDERSON SR                    19    94    94  -     -         188  M/C
David CORSTORPHINE                   20    93    96  -     -         189  M/C
Tom DUNN                             20    90    99  -     -         189  M/C
Peter PAXTON                         20    99    90  -     -         189  M/C
[A] SMITH                            20    94    95  -     -         189  M/C
D. GRANT                             20    95    94  -     -         189  M/C
Bob DOW                              20    95    94  -     -         189  M/C
Walter GOURLAY                       20    92    97  -     -         189  M/C
A.W. SMITH                           27    91    99  -     -         190  M/C
Douglas Argyll ROBERTSON             27    97    93  -     -         190  M/C
Robert ARMIT                         29    95    96  -     -         191  M/C
George  STRATH                       29    97    94  -     -         191  M/C
J.H. BLACKWELL                       31    96    96  -     -         192  M/C
Tom MANZIE                           32    96    97  -     -         193  M/C
George LOWE                          33    94   100  -     -         194  M/C
G. HONEYMAN                          33    97    97  -     -         194  M/C
James FENTON                         35    99    97  -     -         196  M/C
Robert TAIT                          35    99    97  -     -         196  M/C
Bob KIRK                             37    99    98  -     -         197  M/C
Rev. D. LUNDIE                       37    98    99  -     -         197  M/C
Fitz BOOTHBY                         39    96   102  -     -         198  M/C
J. Thomson WHITE                     40   102    99  -     -         201  M/C
James KIRK                           41   105    97  -     -         202  M/C
W.H. GOFF                            42   105    99  -     -         204  M/C

Upvotes: 4

buran
buran

Reputation: 14233

import requests
from bs4 import BeautifulSoup

def parse_row(row):
    for div in row.find_all("div", {"class": "final-leaderboard__content"}):
        yield div.text.strip().replace('\n', ' ')


url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find("div", {"class": "final-leaderboard__table"})
rows = table.find_all('div', {'class':"final-leaderboard__row"})
header = list(parse_row(rows[0]))
for row in rows[1:]:
    print(dict(zip(header, list(parse_row(row)))))

output

{'Name': 'Jamie ANDERSON     Champion Golfer', 'Finish': '1', 'R1': '84', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '169', 'Par': 'M/C'}
{'Name': 'Andrew KIRKALDY', 'Finish': '2', 'R1': '86', 'R2': '86', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'Jamie ALLAN', 'Finish': '2', 'R1': '88', 'R2': '84', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'George  PAXTON', 'Finish': '4', 'R1': '89', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '174', 'Par': 'M/C'}
{'Name': 'Tom KIDD', 'Finish': '5', 'R1': '87', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '175', 'Par': 'M/C'}
{'Name': 'Bob FERGUSON', 'Finish': '6', 'R1': '89', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '176', 'Par': 'M/C'}
{'Name': 'J.O.F. MORRIS', 'Finish': '7', 'R1': '92', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '179', 'Par': 'M/C'}
{'Name': 'Jack KIRKALDY', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'James RENNIE', 'Finish': '8', 'R1': '93', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'Willie FERNIE', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'David AYTON', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Henry LAMB', 'Finish': '11', 'R1': '91', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom ARUNDEL', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom MORRIS SR', 'Finish': '14', 'R1': '92', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'William DOLEMAN', 'Finish': '14', 'R1': '91', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Robert KINSMAN', 'Finish': '14', 'R1': '88', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Bob MARTIN', 'Finish': '17', 'R1': '93', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '186', 'Par': 'M/C'}
{'Name': 'Ben SAYERS', 'Finish': '18', 'R1': '92', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '187', 'Par': 'M/C'}
{'Name': 'David ANDERSON SR', 'Finish': '19', 'R1': '94', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '188', 'Par': 'M/C'}
{'Name': 'David CORSTORPHINE', 'Finish': '20', 'R1': '93', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Tom DUNN', 'Finish': '20', 'R1': '90', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Peter PAXTON', 'Finish': '20', 'R1': '99', 'R2': '90', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': '[A] SMITH', 'Finish': '20', 'R1': '94', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'D. GRANT', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Bob DOW', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Walter GOURLAY', 'Finish': '20', 'R1': '92', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'A.W. SMITH', 'Finish': '27', 'R1': '91', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Douglas Argyll ROBERTSON', 'Finish': '27', 'R1': '97', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Robert ARMIT', 'Finish': '29', 'R1': '95', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'George  STRATH', 'Finish': '29', 'R1': '97', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'J.H. BLACKWELL', 'Finish': '31', 'R1': '96', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '192', 'Par': 'M/C'}
{'Name': 'Tom MANZIE', 'Finish': '32', 'R1': '96', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '193', 'Par': 'M/C'}
{'Name': 'George LOWE', 'Finish': '33', 'R1': '94', 'R2': '100', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'G. HONEYMAN', 'Finish': '33', 'R1': '97', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'James FENTON', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Robert TAIT', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Bob KIRK', 'Finish': '37', 'R1': '99', 'R2': '98', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Rev. D. LUNDIE', 'Finish': '37', 'R1': '98', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Fitz BOOTHBY', 'Finish': '39', 'R1': '96', 'R2': '102', 'R3': '-', 'R4': '-', 'Total': '198', 'Par': 'M/C'}
{'Name': 'J. Thomson WHITE', 'Finish': '40', 'R1': '102', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '201', 'Par': 'M/C'}
{'Name': 'James KIRK', 'Finish': '41', 'R1': '105', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '202', 'Par': 'M/C'}
{'Name': 'W.H. GOFF', 'Finish': '42', 'R1': '105', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '204', 'Par': 'M/C'}

of course, instead of dict you may use other data structure like namedtuple

Upvotes: 3

Related Questions