Reputation: 185
I'm struggling with BeautifulSoup. I want to scrape the links of the contests in the table on the right side of [Transfermarkt][1]
How I locate it so far:
div1 = soup.find('div', {'class': 'large-4 columns'})
div2 = div1.find('div', {'class': 'box'})
table = div2.find('table')
table_body = table.find('tbody')
contest = table_body.find_all('a')
Problem is that this is not specific enough. I sometimes find double values, which completely destroy my structure...
Is there a better way to locate this exact position?
The position I need: "a" "title" inside of "td" "class=no-border-links"
Upvotes: 0
Views: 67
Reputation: 22440
Try the following to get the desired content:
import re
import requests
from bs4 import BeautifulSoup
URL = "https://www.transfermarkt.de/jumplist/erfolge/spieler/17259"
res = requests.get(URL,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".table-header:contains('Alle Titel') + table tr"):
if not items.find("a",string=re.compile("\w")):continue
item = items.find("a",string=re.compile("\w")).text
print(item)
To get the link as well, try below:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
URL = "https://www.transfermarkt.de/jumplist/erfolge/spieler/17259"
res = requests.get(URL,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".table-header:contains('Alle Titel') + table tr"):
if not items.find("a",string=re.compile("\w")):continue
item = items.find("a",string=re.compile("\w")).text
try:
link = urljoin(URL,items.select_one("a[href^='/']").get("href"))
except AttributeError: link = ""
print(item,link)
Upvotes: 1
Reputation: 1123
It is better to use select
for this case.
for title in soup.select('.large-4.columns td.no-border-links > a'):
if title.text:
print(title.text)
Output will be
Weltmeisterschaft 2014
UEFA Champions League
1.Bundesliga
1.Bundesliga
1.Bundesliga
1.Bundesliga
FC Bayern München
1.Bundesliga
UEFA Champions League
1.Bundesliga
1.Bundesliga
1.Bundesliga
Deutschland
Deutschland
Weltmeisterschaft 2018
Weltmeisterschaft 2014
Weltmeisterschaft 2010
Europameisterschaft 2016
Europameisterschaft 2012
Weltmeisterschaft 2014
U21-Europameisterschaft 2009
UEFA Champions League
1.Bundesliga
Weltmeisterschaft 2010
Deutschland
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Champions League
UEFA Super Cup
FC Bayern München
FC Bayern München
FC Bayern München
Deutschland
FIFA Klub-WM
DFB-Pokal
DFB-Pokal
DFB-Pokal
DFB-Pokal
DFL-Supercup
DFL-Supercup
DFL-Supercup
DFB-SuperCup
DFB-Pokal
U21-Europameisterschaft 2009
Upvotes: 2
Reputation: 71
Try using the select
function in the soup library where you can use CSS selectors.
In your case, you could use something like-
a_tags = soup.select("td[class='no-border-links'] > a")
Now you can iterate over this to get the titles using the text
attribute.
Upvotes: 0