Reputation: 102
I am using BeautifulSoup to scrape data from a website and the original text is in form of - "The 'Hello, World' event", but when i try to extract it using html.parser and write it to csv file it becomes "The ‘Hello, World’ event", I want it to be written to csv in original form. Below is my code -
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
data = get_data(soup)
with open('The-Hindu-Dataset.csv', 'a', newline='', encoding="UTF-8") as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerows(data)
Upvotes: 0
Views: 1763
Reputation: 177800
It's written correctly. Your viewer is using the wrong encoding. Set that program to UTF-8, or try encoding='utf-8-sig' instead, esp. if you are on Windows. That writes a signature that editors like Notepad and Excel will detect and automatically decode as UTF-8.
Example:
#coding:utf8
import csv
with open('test1.csv','w',encoding='utf8',newline='') as f:
w = csv.writer(f)
w.writerow(['The ‘Hello, World’ event','你好,世界!'])
w.writerow(['The ‘Hello, World’ event','你好,世界!'])
with open('test2.csv','w',encoding='utf-8-sig',newline='') as f:
w = csv.writer(f)
w.writerow(['The ‘Hello, World’ event','你好,世界!'])
w.writerow(['The ‘Hello, World’ event','你好,世界!'])
test1.csv:
test2.csv:
The files are otherwise encoded identically except for the signature hinting at UTF-8-encoding. Windows will assume a localized default encoding (typically Windows-1252) without it. Even my hex compare tool assumed Windows-1252:
A better editor (such as Notepad++, or newer versions of Notepad) will display both files correctly.
Upvotes: 1