viktor
viktor

Reputation: 185

Finding location of class for BeautifulSoup

I'm struggling with BeautifulSoup. I want to scrape the links of the contests in the table on the right side of [Transfermarkt][1]

How I locate it so far:

div1 = soup.find('div', {'class': 'large-4 columns'})
div2 = div1.find('div', {'class': 'box'})
table = div2.find('table')
table_body = table.find('tbody')
contest = table_body.find_all('a')

Problem is that this is not specific enough. I sometimes find double values, which completely destroy my structure...

Is there a better way to locate this exact position?

The position I need: "a" "title" inside of "td" "class=no-border-links"

Upvotes: 0

Views: 67

Answers (3)

SIM
SIM

Reputation: 22440

Try the following to get the desired content:

import re
import requests
from bs4 import BeautifulSoup

URL = "https://www.transfermarkt.de/jumplist/erfolge/spieler/17259"

res = requests.get(URL,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".table-header:contains('Alle Titel') + table tr"):
    if not items.find("a",string=re.compile("\w")):continue
    item = items.find("a",string=re.compile("\w")).text
    print(item)

To get the link as well, try below:

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

URL = "https://www.transfermarkt.de/jumplist/erfolge/spieler/17259"

res = requests.get(URL,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".table-header:contains('Alle Titel') + table tr"):
    if not items.find("a",string=re.compile("\w")):continue
    item = items.find("a",string=re.compile("\w")).text
    try:
        link = urljoin(URL,items.select_one("a[href^='/']").get("href"))
    except AttributeError: link = ""
    print(item,link)

Upvotes: 1

Selçuk
Selçuk

Reputation: 1123

It is better to use select for this case.

for title in soup.select('.large-4.columns td.no-border-links > a'):
    if title.text:
        print(title.text)

Output will be

Weltmeisterschaft 2014
UEFA Champions League
1.Bundesliga
1.Bundesliga
1.Bundesliga
1.Bundesliga
FC Bayern München
1.Bundesliga
UEFA Champions League
1.Bundesliga
1.Bundesliga
1.Bundesliga
Deutschland
Deutschland
Weltmeisterschaft 2018
Weltmeisterschaft 2014
Weltmeisterschaft 2010
Europameisterschaft 2016
Europameisterschaft 2012
Weltmeisterschaft 2014
U21-Europameisterschaft 2009
UEFA Champions League
1.Bundesliga
Weltmeisterschaft 2010
Deutschland
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Super Cup
FC Bayern München
FC Bayern München
FC Bayern München
Deutschland
FIFA Klub-WM
DFB-Pokal
DFB-Pokal
DFB-Pokal
DFB-Pokal
DFL-Supercup
DFL-Supercup
DFL-Supercup
DFB-SuperCup
DFB-Pokal
U21-Europameisterschaft 2009

Upvotes: 2

Pranav G.
Pranav G.

Reputation: 71

Try using the select function in the soup library where you can use CSS selectors.

In your case, you could use something like-

a_tags = soup.select("td[class='no-border-links'] > a")

Now you can iterate over this to get the titles using the text attribute.

Upvotes: 0

Related Questions