batman
batman

Reputation: 102

BeautifulSoup replace single quote( ' ) with ( ‘ ) when writing to csv file

I am using BeautifulSoup to scrape data from a website and the original text is in form of - "The 'Hello, World' event", but when i try to extract it using html.parser and write it to csv file it becomes "The ‘Hello, World’ event", I want it to be written to csv in original form. Below is my code -

page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
data = get_data(soup)
with open('The-Hindu-Dataset.csv', 'a', newline='', encoding="UTF-8") as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    writer.writerows(data)

Upvotes: 0

Views: 1763

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177800

It's written correctly. Your viewer is using the wrong encoding. Set that program to UTF-8, or try encoding='utf-8-sig' instead, esp. if you are on Windows. That writes a signature that editors like Notepad and Excel will detect and automatically decode as UTF-8.

Example:

#coding:utf8
import csv

with open('test1.csv','w',encoding='utf8',newline='') as f:
    w = csv.writer(f)
    w.writerow(['The ‘Hello, World’ event','你好,世界!'])
    w.writerow(['The ‘Hello, World’ event','你好,世界!'])

with open('test2.csv','w',encoding='utf-8-sig',newline='') as f:
    w = csv.writer(f)
    w.writerow(['The ‘Hello, World’ event','你好,世界!'])
    w.writerow(['The ‘Hello, World’ event','你好,世界!'])

test1.csv:

Incorrect Excel display

test2.csv:

Correct Excel display

The files are otherwise encoded identically except for the signature hinting at UTF-8-encoding. Windows will assume a localized default encoding (typically Windows-1252) without it. Even my hex compare tool assumed Windows-1252:

Hex compare of files

A better editor (such as Notepad++, or newer versions of Notepad) will display both files correctly.

Upvotes: 1

Related Questions