issues saving data into csv during webscraping

Question

I am having some issues while saving rows in a csv file after web scraping. I used the same notation and it worked well before on another site but now the csv file is blank. It seems python is not writing any row.

I show you my code, thanks in advance:

import requests
from bs4 import BeautifulSoup
import csv
import lxml

html_page = requests.get('https://www.scrapethissite.com/pages/forms/?page_num=1').text
soup = BeautifulSoup(html_page, 'lxml')

# get the number of pages (it might change in the future as the data is updated)
pagenum = soup.find('ul', {'class': 'pagination'})
n = pagenum.findAll('li')[-2].find('a')['href'].split('=')[1]

# now we convert the value of the page in a range so that we can loop over it
page = range(1, int(n) + 1)
print(page)

with open('HockeyLeague.csv', 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(['team_name', 'year', 'wins', 'losses', 'win_perc', 'goal_for', 'goal_against'])

    for p in page:
        html_page = requests.get(f'https://www.scrapethissite.com/pages/forms/?page_num={p}&per_page=25').text
        soup = BeautifulSoup(html_page, 'lxml')

        table = soup.find('table', {'class': 'table'})

        for row in table.findAll('tr', {'class': 'team'}):

            # getting the wanted variables:
            team_name = row.find('td', {'class': 'name'}).text
            year = row.find('td', {'class': 'year'}).text
            wins = row.find('td', {'class': 'wins'}).text
            losses = row.find('td', {'class': 'losses'}).text
            goal_for = row.find('td', {'gf'}).text
            goal_against = row.find('td', {'ga'}).text

            try:
                win_perc = row.find('td', {'pct text-success'}).text
            except:
                win_perc = row.find('td', {'pct text-danger'}).text

            # write the data in the csv file we created at the beginning
            csv_writer.writerow([team_name, year, wins, losses, win_perc, goal_for, goal_against])

HedgeHog · Accepted Answer

Cause script in general is working these are just some things you should keep in mind:

I would recommend opening the file with newline='' on all platforms to disable universal newlines translation and encoding='utf-8' to be sure you are working on the "correct" one:
```
with open('HockeyLeague.csv', 'w', newline='', encoding='utf-8') as f:
    ...
```

.strip() your texts or use .get_text(strip=True) to get a clean output and avoid linebreaks you do not wont.

team_name = row.find('td', {'class': 'name'}).text.strip()
year = row.find('td', {'class': 'year'}).text.strip() 
...

In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs

Alternative Example

Uses a while loop the checks the "Next Button" and extract its url, also stripped_strings to extract the texts from each row:

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://www.scrapethissite.com/pages/forms/'

with open('HockeyLeague.csv', 'w', newline='', encoding='utf-8') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(['team_name', 'year', 'wins', 'losses', 'win_perc', 'goal_for', 'goal_against'])

    while True:
        html_page = requests.get(url).text
        soup = BeautifulSoup(html_page)

        for row in soup.find_all('tr', {'class': 'team'}):
            # write the data in the csv file we created at the beginning 
            csv_writer.writerow(list(row.stripped_strings)[:-1])

        if soup.select_one('.pagination a[aria-label="Next"]'):
            url = 'https://www.scrapethissite.com'+soup.select_one('.pagination a[aria-label="Next"]').get('href')
        else:
            break

Output

team_name,year,wins,losses,win_perc,goal_for,goal_against
Boston Bruins,1990,44,24,0.55,299,264
Buffalo Sabres,1990,31,30,0.388,292,278
Calgary Flames,1990,46,26,0.575,344,263
Chicago Blackhawks,1990,49,23,0.613,284,211
Detroit Red Wings,1990,34,38,0.425,273,298
Edmonton Oilers,1990,37,37,0.463,272,272
...

issues saving data into csv during webscraping

Answers (1)

Alternative Example

Output

Related Questions