MITHU
MITHU

Reputation: 154

Can't get rid of illegible contents while writing to a csv file

I've written a script in python using post requests to scrape the json content from a webpage. When I run my script, I get the result in the console as expected. However, I encounter an issue, when I try to write the same in a csv file. When I try like: with open ("outputContent.csv","w",newline="") as f:

I encounter the following error:

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\all_reviews_grabber.py", line 27, in <module>
    writer.writerow([nom,ville,region])
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 16: character maps to <undefined>

When I try like the following, the script does produce a data ridden csv file:

with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:

But, the csv file contains some illegible contents, as in:

Beijingshì
Xinjiangwéiwúerzìzhìqu
Shànghaishì
Qingpuqu
Shànghaishì
Xúhuìqu
Putuóqu

This is my script so far:

import csv
import requests
from bs4 import BeautifulSoup

baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"

with requests.Session() as s:
    req = s.get(baseUrl)
    sauce = BeautifulSoup(req.text,"lxml")
    token = sauce.select_one("input[name='_token']")['value']

    payload = {
        'data': 'country=0&type=0&input_search=',
        '_token': token
        }

    res = s.post(postUrl,data=payload)
    with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(['nom','ville','region'])
        for item in res.json():
            nom = item['prospect_nom']
            ville = item['prospect_ville']
            region = item['prospect_region']
            print(nom,ville,region)
            writer.writerow([nom,ville,region])

How can I write the content in the right way in a csv file?

Upvotes: 0

Views: 176

Answers (3)

snakecharmerb
snakecharmerb

Reputation: 55699

The code works correctly, as long as the print statement is removed*.

The corrupted data that you are seeing is because you are decoding the file data from cp1252, rather than UTF-8 when you view it.

>>> s = 'Xinjiangwéiwúerzìzhìqu'
>>> encoded = s.encode('utf-8')
>>> encoded.decode('cp1252')
'Xinjiangwéiwúerzìzhìqu'

If you are viewing the data by opening the csv file in Python, ensure that you specify UTF-8 encoding** when you open it:

open('outputContent.csv', 'r', encoding='utf-8'...

If you are opening the file with an application such as Excel, ensure that you specify that the encoding is UTF-8 when opening it.

If you don't specify an encoding the default cp1252 encoding will be used to decode the data in the file, and you will see garbage data.


* print will automatically use the default encoding, so you'll get an exception if it tries to encode characters which can't be encoded as cp1252.

** It may also be worth trying the 'utf-8-sig' encoding, which is a Microsoft-specific version of UTF-8 that inserts a byte-order-mark or BOM (b'\xef\xbb\xbf') at the beginning of encoded strings, but is otherwise identical to UTF-8.

Upvotes: 0

Niharika Bitra
Niharika Bitra

Reputation: 477

Take a look at this - http://www.pgbovine.net/unicode-python-errors.htm

  1. Check your default encoding in your interpreter:

    import sys

    sys.stdout.encoding

  2. An old version of Python can also cause this error.

Upvotes: 1

chitown88
chitown88

Reputation: 28595

Would using pandas to parse and then write alleviate the issue?

import pandas as pd
import requests
from bs4 import BeautifulSoup

baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"

with requests.Session() as s:
    req = s.get(baseUrl)
    sauce = BeautifulSoup(req.text,"lxml")
    token = sauce.select_one("input[name='_token']")['value']

    payload = {
        'data': 'country=0&type=0&input_search=',
        '_token': token
        }

    res = s.post(postUrl,data=payload)
    jsonObj = res.json()

    results = pd.DataFrame()
    for item in jsonObj:
        nom = item['prospect_nom']
        ville = item['prospect_ville']
        region = item['prospect_region']
        #print(id_,nom,ville,region)
        temp_df = pd.DataFrame([[nom,ville,region]], columns = ['nom','ville','region'])
        results = results.append(temp_df)

results = results.reset_index(drop=True)
results.to_csv("outputContent.csv", idex=False)

Upvotes: 0

Related Questions