Omiod
Omiod

Reputation: 11653

How to handle utf-8 text with Python 3?

I need to parse various text sources and then print / store it somewhere.

Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.

(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)

The following is a code example:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')

print(title)

file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()

I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:

b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'

Both on screen and file. Is there a proper way to do this ?

Upvotes: 5

Views: 48229

Answers (3)

Tarkeshwar Prasad
Tarkeshwar Prasad

Reputation: 21

JSON data to Unicode support for Japanese characters

def jsonFileCreation (messageData, fileName): 
   with open(fileName, "w", encoding="utf-8") as outfile:
         json.dump(messageData, outfile, indent=8, sort_keys=False,ensure_ascii=False)

Upvotes: 2

qjx
qjx

Reputation: 11

The function print(A) in python3 will first convert the string A to bytes with its original encoding, and then print it through 'gbk' encoding. So if you want to print A in utf-8, you first need to convert A with gbk as follow:

print(A.encode('gbk','ignore').decode('gbk'))

Upvotes: 1

Dean Fenster
Dean Fenster

Reputation: 2395

In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.

In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')

print(title)

file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()

Upvotes: 10

Related Questions