Snooley
Snooley

Reputation: 71

Scraping Link Location

I have been following FC Pythons tutorial on web scraping and I do not understand how they have identified 1,41,2 as the link locations for this page. Is this something I should be able to see on the page source?

#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

#Create an empty list to assign these values to
teamLinks = []

#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")

#We need the location that the link is pointing to, so for each link, take the link location. 
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
    teamLinks.append(links[i].get("href"))

#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]

Upvotes: 1

Views: 129

Answers (3)

Sers
Sers

Reputation: 12255

In the website for each row in the table, there're 3 a.vereinprofil_tooltip links and they're same with same href. To avoid duplications the use 1, 3, 5, etc. links. And yes you should see it in the page source, and also in Chrome Dev Tools.

Also you collect links using different ways:

  • Use different selector, like #yw1 .zentriert a.vereinprofil_tooltip for CLUBS - PREMIER LEAGUE 19/20 table
  • Use python code to remove duplicates:
    team_links = list(dict.fromkeys([f"https://www.transfermarkt.co.uk{x['href']}" for x in soup.select("a.vereinprofil_tooltip")]))

Upvotes: 2

Roomm
Roomm

Reputation: 924

The range(1,41,2) is used to avoid duplicated links. That's because in the table, each row there are multiple cells that contains the same link:

enter image description here

We can obtain the same result getting all links and removing duplicates with a Set:

teamLinks = list({x.get("href") for x in links})

Upvotes: 2

petezurich
petezurich

Reputation: 10194

It is an attempt to remove duplicate entries.

A more robust way to achieve the same is this:

# iterate over all links in list
for i in range(len(links)):
    teamLinks.append(links[i].get("href"))

for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]

# make a set to remove duplicates and then make a list of it again
teamLinks = list(set(teamLinks))

teamLinks prints out to sth like this then:

['https://www.transfermarkt.co.uk/crystal-palace/spielplan/verein/873/saison_id/2019',
 'https://www.transfermarkt.co.uk/afc-bournemouth/spielplan/verein/989/saison_id/2019',
 'https://www.transfermarkt.co.uk/sheffield-united/spielplan/verein/350/saison_id/2019',
...

Upvotes: 1

Related Questions