Reputation: 185
I would like to find all the href and title (i. e. club names with the according link) text inside a div. I got the following code. How can I extract each item here?
My Code:
import requests
import xlrd
import xlsxwriter
from bs4 import BeautifulSoup
list0 = list(['Verein'])
list1 = list(['Verein_Link'])
list2 = list(['Zugehörige_Vereine'])
list3 = list(['Zugehörige_Vereine_Link'])
workbook = xlrd.open_workbook('url_allclubs.xlsx')
worksheet = workbook.sheet_by_name('Sheet1')
rows = worksheet.nrows
for i in range(0, rows):
url = worksheet.cell(i, 0)
url = str.replace(str(url), 'text:', '')
url = url[1:-1]
headers = {'Host': 'www.transfermarkt.de',
'Referer': 'https://www.transfermarkt.de/jumplist/startseite/verein/27',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page = 'https://www.transfermarkt.de/jumplist/startseite/verein/27'
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'lxml')
club = soup.find_all('h1')
allclubs = soup.find_all(id='alleTemsVerein')
list0.append(str(club[0].text))
list1.append(str('x') + str(url))
list2.append(str(allclubs[0])) > this is not working yet
list3.append(str(allclubs[0])) > this is not working yet
book = xlsxwriter.Workbook('allclubs.xlsx')
sheet1 = book.add_worksheet()
for i, e in enumerate(list0):
sheet1.write(i, 0, e)
for i, e in enumerate(list1):
sheet1.write(i, 1, e)
for i, e in enumerate(list2):
sheet1.write(i, 2, e)
for i, e in enumerate(list2):
sheet1.write(i, 3, e)
book.close()
This is what I get from my allclubs soup:
Here you can see where to find the list of all clubs:
How can I drill down further in the allclubs soup, so that I can extract the club name and link in a loop?
Upvotes: 0
Views: 2070
Reputation: 8245
You can find all the the links in that allclubs div and then get their .text
for the title and 'href'
attribute for the link.
import requests
from bs4 import BeautifulSoup
headers = {'Host': 'www.transfermarkt.de',
'Referer': 'https://www.transfermarkt.de/jumplist/startseite/verein/27',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
url= 'https://www.transfermarkt.de/jumplist/startseite/verein/27'
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'lxml')
club = soup.find_all('h1')
allclubs = soup.find(id='alleTemsVerein')
team_links=allclubs.find_all('a')
for link in team_links:
print(link.text,link['href'])
Output
FC Bayern München /fc-bayern-munchen/startseite/verein/27
FC Bayern München II /fc-bayern-munchen-ii/startseite/verein/28
FC Bayern München U19 /fc-bayern-munchen-u19/startseite/verein/1462
FC Bayern München U17 /fc-bayern-munchen-u17/startseite/verein/21058
FC Bayern München U16 /fc-bayern-munchen-u16/startseite/verein/23112
FC Bayern München UEFA U19 /fc-bayern-munchen-uefa-u19/startseite/verein/41585
FC Bayern München Jugend /fc-bayern-munchen-jugend/startseite/verein/18936
Note that i have used find
for allclubs as there is only one div with that id.
Upvotes: 2