Reputation: 71
I have been following FC Pythons tutorial on web scraping and I do not understand how they have identified 1,41,2
as the link locations for this page. Is this something I should be able to see on the page source?
#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')
#Create an empty list to assign these values to
teamLinks = []
#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")
#We need the location that the link is pointing to, so for each link, take the link location.
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
teamLinks.append(links[i].get("href"))
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
Upvotes: 1
Views: 129
Reputation: 12255
In the website for each row in the table, there're 3 a.vereinprofil_tooltip
links and they're same with same href. To avoid duplications the use 1, 3, 5, etc. links. And yes you should see it in the page source, and also in Chrome Dev Tools.
Also you collect links using different ways:
#yw1 .zentriert a.vereinprofil_tooltip
for CLUBS - PREMIER LEAGUE 19/20 tableteam_links = list(dict.fromkeys([f"https://www.transfermarkt.co.uk{x['href']}" for x in soup.select("a.vereinprofil_tooltip")]))
Upvotes: 2
Reputation: 924
The range(1,41,2)
is used to avoid duplicated links. That's because in the table, each row there are multiple cells that contains the same link:
We can obtain the same result getting all links and removing duplicates with a Set:
teamLinks = list({x.get("href") for x in links})
Upvotes: 2
Reputation: 10194
It is an attempt to remove duplicate entries.
A more robust way to achieve the same is this:
# iterate over all links in list
for i in range(len(links)):
teamLinks.append(links[i].get("href"))
for i in range(len(teamLinks)):
teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
# make a set to remove duplicates and then make a list of it again
teamLinks = list(set(teamLinks))
teamLinks
prints out to sth like this then:
['https://www.transfermarkt.co.uk/crystal-palace/spielplan/verein/873/saison_id/2019',
'https://www.transfermarkt.co.uk/afc-bournemouth/spielplan/verein/989/saison_id/2019',
'https://www.transfermarkt.co.uk/sheffield-united/spielplan/verein/350/saison_id/2019',
...
Upvotes: 1