Reputation: 154
I've written a script in python using post requests to scrape the json content from a webpage. When I run my script, I get the result in the console as expected. However, I encounter an issue, when I try to write the same in a csv file.
When I try like:
with open ("outputContent.csv","w",newline="") as f:
I encounter the following error:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\all_reviews_grabber.py", line 27, in <module>
writer.writerow([nom,ville,region])
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 16: character maps to <undefined>
When I try like the following, the script does produce a data ridden csv file:
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
But, the csv file contains some illegible contents, as in:
Beijingshì
Xinjiangwéiwúerzìzhìqu
Shà nghaishì
Qingpuqu
Shà nghaishì
Xúhuìqu
Putuóqu
This is my script so far:
import csv
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['nom','ville','region'])
for item in res.json():
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
print(nom,ville,region)
writer.writerow([nom,ville,region])
How can I write the content in the right way in a csv file?
Upvotes: 0
Views: 176
Reputation: 55699
The code works correctly, as long as the print
statement is removed*.
The corrupted data that you are seeing is because you are decoding the file data from cp1252, rather than UTF-8 when you view it.
>>> s = 'Xinjiangwéiwúerzìzhìqu'
>>> encoded = s.encode('utf-8')
>>> encoded.decode('cp1252')
'Xinjiangwéiwúerzìzhìqu'
If you are viewing the data by opening the csv file in Python, ensure that you specify UTF-8 encoding** when you open it:
open('outputContent.csv', 'r', encoding='utf-8'...
If you are opening the file with an application such as Excel, ensure that you specify that the encoding is UTF-8 when opening it.
If you don't specify an encoding the default cp1252 encoding will be used to decode the data in the file, and you will see garbage data.
* print
will automatically use the default encoding, so you'll get an exception if it tries to encode characters which can't be encoded as cp1252.
** It may also be worth trying the 'utf-8-sig' encoding, which is a Microsoft-specific version of UTF-8 that inserts a byte-order-mark or BOM (b'\xef\xbb\xbf'
) at the beginning of encoded strings, but is otherwise identical to UTF-8.
Upvotes: 0
Reputation: 477
Take a look at this - http://www.pgbovine.net/unicode-python-errors.htm
Check your default encoding in your interpreter:
import sys
sys.stdout.encoding
An old version of Python can also cause this error.
Upvotes: 1
Reputation: 28595
Would using pandas to parse and then write alleviate the issue?
import pandas as pd
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
jsonObj = res.json()
results = pd.DataFrame()
for item in jsonObj:
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
#print(id_,nom,ville,region)
temp_df = pd.DataFrame([[nom,ville,region]], columns = ['nom','ville','region'])
results = results.append(temp_df)
results = results.reset_index(drop=True)
results.to_csv("outputContent.csv", idex=False)
Upvotes: 0