Reputation: 51
Currently struggling with the following output in .csv where their is various random character within the is the players names and values where there shouldn't be
(I've given a picture below of the output)
I'm wondering where I'm going wrong in the code where I'm struggling to eliminate the random characters
I'm trying to remove the characters below such as Â, Ã, ©, ‰ and so on. Any suggestions?
Python Code
#importing
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.106 Safari/537.36'}
#calling websites
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#calling players names
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
#Let's look at the first name in the Players list.
Players[0].text
#calling value of players
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
#Let's look at the first name in the Values list.
Values[0].text
PlayersList = []
ValuesList = []
for i in range(0,25):
PlayersList.append(Players[i].text)
ValuesList.append(Values[i].text)
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})
df.to_csv('2000.csv', index=False)
df.head()
====================================================================
My Excel output
Upvotes: 1
Views: 1891
Reputation: 12624
...
utf8_bom = '\xEF\xBB\xBF'
with open('2000.csv', 'w') as csv_file:
csv_file.write(utf8_bom)
df.to_csv(csv_file, index=False, mode='a')
Explanation: The BOM is the byte order mark (q.v.). If Excel finds it at the beginning of the CSV file, it uses it to determine the encoding, which in your case is UTF-8 (the default encoding – correctly – for Python 3).
EDIT
As Mark Tolonen pointed out, the compact version of the above is the following code:
df.to_csv('2000.csv', encoding='utf-8-sig', index=False)
The -sig
in the name of the encoding stands for “signature”, i.e., the BOM at the beginning which is used by Microsoft software to detect the encoding. See also the Encodings and Unicode section of the codecs
manual.
Upvotes: 4
Reputation: 177665
You system seems to be writing the file encoded as UTF-8. Excel expects UTF-8 files to have a BOM signature, else it assumes a text file is encoded in a locale-specific ANSI encoding. This is for backward compatibility due to Windows existing before UTF-8 did.
Python has an encoding that writes the UTF-8 BOM signature, utf-8-sig
, so simply use:
df.to_csv('2000.csv', encoding='utf-8-sig', index=False)
Upvotes: 2
Reputation: 51
UPDATE:
I've fixed this situation from the following answer in the link below..
https://stackoverflow.com/a/6488070/10675615
- Save the exported file as a csv in the cmd prompt
- Open Excel
- Import the data using Data-->Import External Data/ Get Text/CSV --> Import Data
- Select the file type of "csv" and browse to your file
- In the import wizard change the File_Origin to "65001 UTF" (or choose the correct language character identifier)
- Change the Delimiter to comma
- Select where to import to and Finish This way the special characters should show correctly.
**
Upvotes: 0