Jacob Garwin
Jacob Garwin

Reputation: 65

"UnicodeEncodeError: 'charmap' codec can't encode character" When Writing to csv Using a Webscraper

I've written a webscraper that scrapes NBA box score data off of basketball-reference. The specific webpage that my error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u0107' in position 11: character maps to <undefined>

is occurring on is here. Lastly, the specific player data that is tripping it up and throwing this specific UnicodeEncodeError is this one (although I am sure the error is more generalized and will be produced with any character that contains an obscure accent mark).

The minimal reproducible code:

def get_boxscore_basic_table(tag): #used to only get specific tables
    tag_id = tag.get("id")
    tag_class = tag.get("class")
    return (tag_id and tag_class) and ("basic" in tag_id and "section_wrapper" in tag_class and not "toggleable" in tag_class)

import requests
from bs4 import BeautifulSoup
import lxml
import csv
import re

website = 'https://www.basketball-reference.com/boxscores/202003110MIA.html'

r = requests.get(website).text
soup = BeautifulSoup(r, 'lxml')

tables = soup.find_all(get_boxscore_basic_table)

in_file = open('boxscore.csv', 'w', newline='')
csv_writer = csv.writer(in_file)
column_names = ['Player','Name','MP','FG','FGA','FG%','3P','3PA','3P%','FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PTS','+/-']
csv_writer.writerow(column_names)

for table in tables:    
    rows = table.select('tbody tr')

    for row in rows:
        building_player = [] #temporary container to hold player and stats
        player_name = row.th.text 
        if 'Reserves' not in player_name: 
            building_player.append(player_name)

        stats = row.select('td.right')

        for stat in stats:
            building_player.append(stat.text)

        csv_writer.writerow(building_player) #writing to csv

in_file.close()

What is the best way around this?

I've seen some stuff online about changing the encoding and specifically using the.encode('utf-8') method on the string before writing to the csv but it seems that this .encode() method, although it stops an error from being thrown, has several of its own problems. For instance; player_name.encode('utf-8') before writing to csv turns the name 'Willy Hernangómez' into 'b'Willy Hernang\xc3\xb3mez'' within by csv... not exactly a step in the right direction.

Any help with this and an explanation as to what is happening would be much appreciated!

Upvotes: 1

Views: 1927

Answers (1)

Epsi95
Epsi95

Reputation: 9047

use

in_file = open('boxscore.csv', 'w', newline='',  encoding='utf-8')

instead of

in_file = open('boxscore.csv', 'w', newline='')

and keep everything the same. Make sure you open Excel in utf-8 encoding

Upvotes: 3

Related Questions