viktor
viktor

Reputation: 185

BeautifulSoup find all title and href inside of div > span > a

I would like to find all the href and title (i. e. club names with the according link) text inside a div. I got the following code. How can I extract each item here?

My Code:

import requests
import xlrd
import xlsxwriter
from bs4 import BeautifulSoup

list0 = list(['Verein'])
list1 = list(['Verein_Link'])
list2 = list(['Zugehörige_Vereine'])
list3 = list(['Zugehörige_Vereine_Link'])

workbook = xlrd.open_workbook('url_allclubs.xlsx')
worksheet = workbook.sheet_by_name('Sheet1')
rows = worksheet.nrows

for i in range(0, rows):
    url = worksheet.cell(i, 0)
    url = str.replace(str(url), 'text:', '')
    url = url[1:-1]

    headers = {'Host': 'www.transfermarkt.de',
               'Referer': 'https://www.transfermarkt.de/jumplist/startseite/verein/27',
               'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

    page = 'https://www.transfermarkt.de/jumplist/startseite/verein/27'
    pageTree = requests.get(url, headers=headers)
    soup = BeautifulSoup(pageTree.content, 'lxml')
    club = soup.find_all('h1')
    allclubs = soup.find_all(id='alleTemsVerein')

    list0.append(str(club[0].text))
    list1.append(str('x') + str(url))
    list2.append(str(allclubs[0]))   > this is not working yet
    list3.append(str(allclubs[0]))   > this is not working yet

book = xlsxwriter.Workbook('allclubs.xlsx')
sheet1 = book.add_worksheet()

for i, e in enumerate(list0):
    sheet1.write(i, 0, e)
for i, e in enumerate(list1):
    sheet1.write(i, 1, e)
for i, e in enumerate(list2):
    sheet1.write(i, 2, e)
for i, e in enumerate(list2):
    sheet1.write(i, 3, e)

book.close()

This is what I get from my allclubs soup: soup

Here you can see where to find the list of all clubs: list of all clubs on the website

How can I drill down further in the allclubs soup, so that I can extract the club name and link in a loop?

Upvotes: 0

Views: 2070

Answers (1)

Bitto
Bitto

Reputation: 8245

You can find all the the links in that allclubs div and then get their .text for the title and 'href' attribute for the link.

import requests
from bs4 import BeautifulSoup
headers = {'Host': 'www.transfermarkt.de',
           'Referer': 'https://www.transfermarkt.de/jumplist/startseite/verein/27',
           'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

url= 'https://www.transfermarkt.de/jumplist/startseite/verein/27'
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'lxml')
club = soup.find_all('h1')
allclubs = soup.find(id='alleTemsVerein')
team_links=allclubs.find_all('a')
for link in team_links:
    print(link.text,link['href'])

Output

FC Bayern München /fc-bayern-munchen/startseite/verein/27
FC Bayern München II /fc-bayern-munchen-ii/startseite/verein/28
FC Bayern München U19 /fc-bayern-munchen-u19/startseite/verein/1462
FC Bayern München U17 /fc-bayern-munchen-u17/startseite/verein/21058
FC Bayern München U16 /fc-bayern-munchen-u16/startseite/verein/23112
FC Bayern München UEFA U19 /fc-bayern-munchen-uefa-u19/startseite/verein/41585
FC Bayern München Jugend /fc-bayern-munchen-jugend/startseite/verein/18936

Note that i have used find for allclubs as there is only one div with that id.

Upvotes: 2

Related Questions