Reputation: 7
I am having some issues while saving rows in a csv file after web scraping. I used the same notation and it worked well before on another site but now the csv file is blank. It seems python is not writing any row.
I show you my code, thanks in advance:
import requests
from bs4 import BeautifulSoup
import csv
import lxml
html_page = requests.get('https://www.scrapethissite.com/pages/forms/?page_num=1').text
soup = BeautifulSoup(html_page, 'lxml')
# get the number of pages (it might change in the future as the data is updated)
pagenum = soup.find('ul', {'class': 'pagination'})
n = pagenum.findAll('li')[-2].find('a')['href'].split('=')[1]
# now we convert the value of the page in a range so that we can loop over it
page = range(1, int(n) + 1)
print(page)
with open('HockeyLeague.csv', 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(['team_name', 'year', 'wins', 'losses', 'win_perc', 'goal_for', 'goal_against'])
for p in page:
html_page = requests.get(f'https://www.scrapethissite.com/pages/forms/?page_num={p}&per_page=25').text
soup = BeautifulSoup(html_page, 'lxml')
table = soup.find('table', {'class': 'table'})
for row in table.findAll('tr', {'class': 'team'}):
# getting the wanted variables:
team_name = row.find('td', {'class': 'name'}).text
year = row.find('td', {'class': 'year'}).text
wins = row.find('td', {'class': 'wins'}).text
losses = row.find('td', {'class': 'losses'}).text
goal_for = row.find('td', {'gf'}).text
goal_against = row.find('td', {'ga'}).text
try:
win_perc = row.find('td', {'pct text-success'}).text
except:
win_perc = row.find('td', {'pct text-danger'}).text
# write the data in the csv file we created at the beginning
csv_writer.writerow([team_name, year, wins, losses, win_perc, goal_for, goal_against])
Upvotes: 0
Views: 58
Reputation: 25241
Cause script in general is working these are just some things you should keep in mind:
I would recommend opening the file with newline=''
on all platforms
to disable universal newlines translation and encoding='utf-8'
to
be sure you are working on the "correct" one:
with open('HockeyLeague.csv', 'w', newline='', encoding='utf-8') as f:
...
.strip()
your texts or use .get_text(strip=True)
to get a clean
output and avoid linebreaks you do not wont.
team_name = row.find('td', {'class': 'name'}).text.strip()
year = row.find('td', {'class': 'year'}).text.strip()
...
In newer code avoid old syntax findAll()
instead use find_all()
-
For more take a minute to check
docs
Uses a while loop the checks the "Next Button" and extract its url, also stripped_strings
to extract the texts from each row:
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.scrapethissite.com/pages/forms/'
with open('HockeyLeague.csv', 'w', newline='', encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(['team_name', 'year', 'wins', 'losses', 'win_perc', 'goal_for', 'goal_against'])
while True:
html_page = requests.get(url).text
soup = BeautifulSoup(html_page)
for row in soup.find_all('tr', {'class': 'team'}):
# write the data in the csv file we created at the beginning
csv_writer.writerow(list(row.stripped_strings)[:-1])
if soup.select_one('.pagination a[aria-label="Next"]'):
url = 'https://www.scrapethissite.com'+soup.select_one('.pagination a[aria-label="Next"]').get('href')
else:
break
team_name,year,wins,losses,win_perc,goal_for,goal_against
Boston Bruins,1990,44,24,0.55,299,264
Buffalo Sabres,1990,31,30,0.388,292,278
Calgary Flames,1990,46,26,0.575,344,263
Chicago Blackhawks,1990,49,23,0.613,284,211
Detroit Red Wings,1990,34,38,0.425,273,298
Edmonton Oilers,1990,37,37,0.463,272,272
...
Upvotes: 1