MLapaj
MLapaj

Reputation: 401

HTML Scraping with Beautiful Soup - Unwanted line breaks

I've been trying to write a script to get data from a html page and save it to .csv file. However I've run into 3 minor problems.

First of all, when saving to .csv I get some unwanted line breaks which mess up the output file.

Secondly, players' names (the data concerns NBA players) appear twice.

from bs4 import BeautifulSoup
import requests
import time


teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']

seasons = []
a=2018
while (a>2016):
    seasons.append(str(a))
    a-=1
print(seasons)  
for season in seasons:

    for team in teams:
        my_url = ' https://www.spotrac.com/nba/'+team+'/cap/'+ season +'/'

        headers = {"User-Agent" : "Mozilla/5.0"}

        response = requests.get(my_url)
        response.content

        soup = BeautifulSoup(response.content, 'html.parser')

        stat_table = soup.find_all('table', class_ = 'datatable')


        my_table = stat_table[0]

        plik = team + season + '.csv'   
        with open (plik, 'w') as r:
            for row in my_table.find_all('tr'):
                for cell in row.find_all('th'):
                    r.write(cell.text)
                    r.write(";")

            for row in my_table.find_all('tr'):
                for cell in row.find_all('td'): 
                    r.write(cell.text)
                    r.write(";")

Also, some of the numbers that are seperated by "." are being automatically converted to dates.

Any ideas how I could solve those problems?

Screenshot of output file

Upvotes: 2

Views: 436

Answers (2)

doriclazar
doriclazar

Reputation: 97

Richard provided a complete answer that works for the 3.6 + versions. It executes file.write() for every cell, though, which is not necessary, so here's an alternative with str.format() which will work for python versions before 3.6, and writes once per row:

from bs4 import BeautifulSoup
import requests
import time

teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']
seasons = [2018, 2017]

for season in seasons:
    for team in teams:
        my_url = 'https://www.spotrac.com/nba/{}/cap/{}/'.format(team, season)
        headers = {"User-Agent": "Mozilla/5.0"}

        response = requests.get(my_url)
        response.content

        soup = BeautifulSoup(response.content, 'html.parser')
        stat_table = soup.find_all('table', class_ = 'datatable')
        my_table = stat_table[0]

        csv_file = '{}-{}.csv'.format(team, season)
        with open(csv_file, 'w') as r:
            for row in my_table.find_all('tr'):
                row_string = ''

                for cell in row.find_all('th'):
                    row_string='{}{};'.format(row_string, cell.text.strip())

                for i, cell in enumerate(row.find_all('td')):
                    cell_string = cell.a.text if i==0 else cell.text
                    row_string='{}{};'.format(row_string, cell_string)

                r.write("{}\n".format(row_string))

Upvotes: 2

Richard
Richard

Reputation: 2355

I made a few changes to your script. To build the URLs, I'm using string interpolation (instead of concatenation). To get rid of the extra whitespace, I'm using the strip() method that is defined on strings. When it comes to the duplication of names, I selected the <a> tag, then called .text on the BeautifulSoup selector.

# pip install beautifulsoup4
# pip install requests

from bs4 import BeautifulSoup
import requests
import time

teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']
seasons = [2018, 2017]

for season in seasons:
    for team in teams:
        my_url = f'https://www.spotrac.com/nba/{team}/cap/{season}/'
        headers = {"User-Agent": "Mozilla/5.0"}

        response = requests.get(my_url)
        response.content

        soup = BeautifulSoup(response.content, 'html.parser')
        stat_table = soup.find_all('table', class_ = 'datatable')
        my_table = stat_table[0]

        csv_file = f'{team}-{season}.csv'
        with open(csv_file, 'w') as r:
            for row in my_table.find_all('tr'):
                for cell in row.find_all('th'):
                    r.write(cell.text.strip())
                    r.write(";")

                for i, cell in enumerate(row.find_all('td')):
                    if i == 0:
                        r.write(cell.a.text.strip())
                    else:
                        r.write(cell.text.strip())
                    r.write(";")
                r.write("\n")

When it comes to Excel converting numbers like 1.31 to dates, that's Excel trying to be smart, and failing. I think when you go to import a CSV, you can choose what column types to use for the data. Check out this guide.

Upvotes: 2

Related Questions