Kevin Carmody
Kevin Carmody

Reputation: 51

Excel CSV output appearing wrong after calling Python File

Currently struggling with the following output in .csv where their is various random character within the is the players names and values where there shouldn't be

(I've given a picture below of the output)

I'm wondering where I'm going wrong in the code where I'm struggling to eliminate the random characters

I'm trying to remove the characters below such as Â, Ã, ©, ‰ and so on. Any suggestions?

Python Code

#importing

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/47.0.2526.106 Safari/537.36'}

#calling websites
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

#calling players names
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
#Let's look at the first name in the Players list.
Players[0].text

#calling value of players
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
#Let's look at the first name in the Values list.
Values[0].text

PlayersList = []
ValuesList = []

for i in range(0,25):
   PlayersList.append(Players[i].text)
   ValuesList.append(Values[i].text)

df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.to_csv('2000.csv', index=False)

df.head()

====================================================================

My Excel output

The Output

Upvotes: 1

Views: 1891

Answers (3)

Walter Tross
Walter Tross

Reputation: 12624

...
utf8_bom = '\xEF\xBB\xBF'
with open('2000.csv', 'w') as csv_file:
    csv_file.write(utf8_bom)
    df.to_csv(csv_file, index=False, mode='a')

Explanation: The BOM is the byte order mark (q.v.). If Excel finds it at the beginning of the CSV file, it uses it to determine the encoding, which in your case is UTF-8 (the default encoding – correctly – for Python 3).


EDIT

As Mark Tolonen pointed out, the compact version of the above is the following code:

df.to_csv('2000.csv', encoding='utf-8-sig', index=False)

The -sig in the name of the encoding stands for “signature”, i.e., the BOM at the beginning which is used by Microsoft software to detect the encoding. See also the Encodings and Unicode section of the codecs manual.

Upvotes: 4

Mark Tolonen
Mark Tolonen

Reputation: 177665

You system seems to be writing the file encoded as UTF-8. Excel expects UTF-8 files to have a BOM signature, else it assumes a text file is encoded in a locale-specific ANSI encoding. This is for backward compatibility due to Windows existing before UTF-8 did.

Python has an encoding that writes the UTF-8 BOM signature, utf-8-sig, so simply use:

df.to_csv('2000.csv', encoding='utf-8-sig', index=False)

Upvotes: 2

Kevin Carmody
Kevin Carmody

Reputation: 51

UPDATE:

I've fixed this situation from the following answer in the link below..

https://stackoverflow.com/a/6488070/10675615

  1. Save the exported file as a csv in the cmd prompt
  2. Open Excel
  3. Import the data using Data-->Import External Data/ Get Text/CSV --> Import Data
  4. Select the file type of "csv" and browse to your file
  5. In the import wizard change the File_Origin to "65001 UTF" (or choose the correct language character identifier)
  6. Change the Delimiter to comma
  7. Select where to import to and Finish This way the special characters should show correctly.

**

Upvotes: 0

Related Questions