Reputation: 25
I don't want to annoy you with my very basic questions, but I am stuck and I hope you can help me. I've done tutorials and watched many videos but i can't figure out what i am doing wrong. I want to scrape data from this table: https://www.youpriboo.com/vorher_102_main_nat.php?action=show&liga=2.BL
This is my code:
import requests
from bs4 import BeautifulSoup
base_URL = 'https://www.youpriboo.com/vorher_102_main_nat.php?action=show&liga='
liga = '2.BL'
URL = base_URL + liga
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36:'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
for name in soup.find_all("td", class_="hac"):
name1 = name.parent.find_all('td')[1] # team1
name2 = name.parent.find_all('td')[2] # team2
wahr1 = name.parent.find_all('td')[6] # wahr1
print(name1.get_text() +' '+ name2.get_text()+' '+ wahr1.get_text())
The Problem is that it gives me the data 3 times and there are 3 numbers listed between the games.
The expected result would look like this:
Armina Bielefeld VfB Stuttgart 34,43
SV Wehen Wiesbaden VfL Osnabrück 34,51
(and so on)
Thanks for your time and work!
I have posted this also here: https://www.reddit.com/r/Python/comments/d9km7y/scraping_data_using_bs4_gives_me_unexpected/
Upvotes: 2
Views: 85
Reputation: 22440
You can scrape and write the results in a csv file in few different ways. The one I prefer to go with is pandas. Try using :has() in the first place to filter out the unwanted content. That said the following should work:
import requests
import pandas as pd
from bs4 import BeautifulSoup
base_URL = 'https://www.youpriboo.com/vorher_102_main_nat.php?action=show&liga='
liga = '2.BL'
URL = f"{base_URL}{liga}"
page = requests.get(URL, headers={"User-Agent": 'Mozilla/5.0'})
soup = BeautifulSoup(page.content, 'html.parser')
df = pd.DataFrame(columns=['Name_One','Name_Ano','Wahr'])
for tr in soup.select('.prognose_tab_1 tr:has(.greycell)'):
name1 = tr.select('.hac')[1].get_text()
name2 = tr.select('.hac')[2].get_text()
wahr1 = tr.select('.greycell')[0].get_text()
df = df.append({'Name_One':name1, 'Name_Ano':name2, 'Wahr':wahr1}, ignore_index=True)
print(f"{name1} {name2} {wahr1}")
df.to_csv("youpriboo.csv", encoding='utf-8', index=False)
Upvotes: 1
Reputation: 33384
Try the below code.This will gives you your expected output.
import requests
from bs4 import BeautifulSoup
base_URL = 'https://www.youpriboo.com/vorher_102_main_nat.php?action=show&liga='
liga = '2.BL'
URL = base_URL + liga
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36:'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
table=soup.select_one(".prognose_tab_1")
for tr in table.select('tr'):
if tr.select('.hac') and tr.select('.greycell'):
name1=tr.select('.hac')[1]
name2 = tr.select('.hac')[2]
wahr1 = tr.select('.greycell')[0]
print(name1.get_text() +' '+ name2.get_text()+' '+ wahr1.get_text())
Output
Arminia Bielefeld VfB Stuttgart 34,43
SV Wehen Wiesbaden VfL Osnabrück 34,51
Jahn Regensburg Hamburger SV 24,18
Karlsruher SC 1. FC Heidenheim 37,70
VfL Bochum SV Darmstadt 98 55,22
Erzgebirge Aue Dynamo Dresden 37,70
FC St. Pauli SV Sandhausen 43,90
SpVgg Greuther Fürth Holstein Kiel 46,23
Hannover 96 1. FC Nürnberg 46,23
Upvotes: 0
Reputation: 1938
The filtering is not correct. Try this approach,
table = soup.find_all("tr")
#print(table)
for row in table:
data = row.find_all("td", class_="hac")
if(len(data)>0):
print(data[1].get_text(),data[2].get_text())
data = row.find_all("td", class_="greycell")
if(len(data)>0):
print(data[0].get_text())
Upvotes: 0