Reputation: 444
I am scraping product description information from Books to Scrape, and saving the data to a csv file. Here is the code to initialize the DictWriter
:
with open('book_data.csv', 'w', newline='', encoding='utf8') as file:
dictWriter = csv.DictWriter(file, fieldnames=['Title', 'Price', 'Rating', 'Description', 'UPC'])
dictWriter.writeheader()`
The program parses the HTML of each page, and saves it to a dict
:
dictWriter.writerow(
{'Title': title, 'Price': price, 'Rating': rating, 'Description': description, 'UPC': upc})
The program works fine, however, characters with accents, such as (à
ê
î
ô
ú
), are unrecognizable in the csv file. I outlined a few examples in red in the screenshot below:
I suspect this is an encoding issue. I thought using encoding='utf8'
with the DictWriter
would resolve this; however, it did not.
Upvotes: 0
Views: 133
Reputation: 46759
The following demo shows a way to get (most of) the details. The output CSV file is correctly written in utf-8 format.
req.content
is used to pass the returned bytes to BeautifulSoup to allow it to parse the HTML and apply the correct encoding. requests also attempts to determine the encoding based on the returned headers (used with req.text
) but as BeautifulSoup is parsing the HTML it can usually make a better choice.
req.encoding
could be used to display encoding for req.text
.
import requests
from bs4 import BeautifulSoup
import csv
url = "https://books.toscrape.com/"
req_main = requests.get(url)
soup_main = BeautifulSoup(req_main.content, "html.parser")
with open("output.csv", "w", newline="", encoding="utf-8") as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=["Title", "Price", "Description"])
csv_output.writeheader()
for article in soup_main.find_all("article", class_="product_pod"):
req_book = requests.get(url + article.a['href'])
soup_book = BeautifulSoup(req_book.content, "html.parser")
row = {
"Title" : soup_book.h1.text,
"Price" : soup_book.find("p", class_="price_color").text,
"Description" : soup_book.find("div", id="product_description").find_next("p").text,
}
csv_output.writerow(row)
The output as shown in Notepad++
with utf-8
encoding:
Notepad++ is able to correctly display the utf-8
encoded characters. If the displayed encoding is changed to display ANSI encoding you get:
You need to ensure your editor can correctly display utf-8 encoding text.
Upvotes: 2