Seraph
Seraph

Reputation: 444

Saving scraped data with the correct encoding

I am scraping product description information from Books to Scrape, and saving the data to a csv file. Here is the code to initialize the DictWriter:

with open('book_data.csv', 'w', newline='', encoding='utf8') as file:
    dictWriter = csv.DictWriter(file, fieldnames=['Title', 'Price', 'Rating', 'Description', 'UPC'])
    dictWriter.writeheader()`

The program parses the HTML of each page, and saves it to a dict:

dictWriter.writerow(
    {'Title': title, 'Price': price, 'Rating': rating, 'Description': description, 'UPC': upc})

The program works fine, however, characters with accents, such as (à ê î ô ú), are unrecognizable in the csv file. I outlined a few examples in red in the screenshot below:

enter image description here

I suspect this is an encoding issue. I thought using encoding='utf8' with the DictWriter would resolve this; however, it did not.

Questions:

  1. What's going on here, why is this happening?
  2. How can I correct this, and save the data in a readable format.

Upvotes: 0

Views: 133

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

The following demo shows a way to get (most of) the details. The output CSV file is correctly written in utf-8 format.

req.content is used to pass the returned bytes to BeautifulSoup to allow it to parse the HTML and apply the correct encoding. requests also attempts to determine the encoding based on the returned headers (used with req.text) but as BeautifulSoup is parsing the HTML it can usually make a better choice.

req.encoding could be used to display encoding for req.text.

import requests
from bs4 import BeautifulSoup
import csv

url = "https://books.toscrape.com/"
req_main = requests.get(url)
soup_main = BeautifulSoup(req_main.content, "html.parser")

with open("output.csv", "w", newline="", encoding="utf-8") as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=["Title", "Price", "Description"])
    csv_output.writeheader()
    
    for article in soup_main.find_all("article", class_="product_pod"):
        req_book = requests.get(url + article.a['href'])
        soup_book = BeautifulSoup(req_book.content, "html.parser")
        
        row = {
            "Title" : soup_book.h1.text,
            "Price" : soup_book.find("p", class_="price_color").text,
            "Description" : soup_book.find("div", id="product_description").find_next("p").text,
        }
        
        csv_output.writerow(row)

The output as shown in Notepad++ with utf-8 encoding:

utf-8 screenshot

Notepad++ is able to correctly display the utf-8 encoded characters. If the displayed encoding is changed to display ANSI encoding you get:

ansi screenshot

You need to ensure your editor can correctly display utf-8 encoding text.

Upvotes: 2

Related Questions