Reputation: 629
I am trying to do some Python text-parsing programming with Hebrew (Unicode) text from the Torah.
Here is link to the example text (Genesis) that I am using from Sefaria.org: https://github.com/Sefaria/Sefaria-Export/blob/master/json/Tanakh/Torah/Genesis/Hebrew/Tanach%20with%20Text%20Only.json
I am able to successfully import the JSON data.
I do the usual data extract tests + TEST OUTPUTS WITH PRINT() to examine the data.
In the following code below, I notice that only the output for KEYS stays on screen/terminal/console. All the other data (VALUES, ITEMS, and the VALUE for the dictionary key 'text') all disappear from the screen (please run the code with the data and see for yourself).
I figure this is some sort of encoding or decoding issue because any text with the Hebrew text (e.g. VALUES, ITEMS, and VALUE for the dictionary key 'text'), so I did standard sys check and printed the following output:
sys.stdin.encoding = cp1252
sys.stdout.encoding = cp1252
I figure that I may need to define/encode/decode or do something to allow written output of UTF-8 UNICODE characters (Hebrew) to the Python terminal.
Any ideas how to solve this issue?
## IMPORT NECESSARY MODULES
import json
import sys
## CHECK ENCODING AND PRINT/TEST OUTPUT
print("sys.stdin.encoding = ", sys.stdin.encoding)
print("sys.stdout.encoding = ", sys.stdout.encoding)
## READ JSON FILE & IMPORT DATA - UTF8 CODING TO READ HEBREW TEXT
json_data = open('DATA_1GENESIS.json', encoding="utf8").read()
## LOADS AND TRANSFORMS JSON DATA TO PYTHON DICTIONARY OBJECT
DictionaryData = json.loads(json_data)
print('\n')
print("IMPORTED JSON DATA TYPE = ", type(DictionaryData))
## LOOP THROUGH DATA AND PRINT
for item in DictionaryData:
print("ITEM = ",item, type(item), len(item))
## TEST OUTPUT
print('\n')
print("IMPORTED DICTIONARY DATA = ",DictionaryData, type(DictionaryData),len(DictionaryData))
## EXTRACT DICTIONARY KEYS - 'dict_keys' object
k = DictionaryData.keys()
print('\n')
print("KEYS = ",k,type(k),len(k))
## EXTRACT DICTIONARY VALUES - 'dict_values' object
v = DictionaryData.values()
print('\n')
print("VALUES = ",v,type(v),len(v))
## EXTRACT DICTIONARY ITEMS - 'dict_items' object
i = DictionaryData.items()
print('\n')
print("ITEMS = ",i,type(i),len(i))
## EXTRACT VALUE FOR KEY 'text' = DictionaryData['text']
text = DictionaryData['text']
print('\n')
print("TEXT = ", text, type(text), len(text))
EDIT
I just did a test to test simple printing of one line only of the Unicode Hebrew. Here is the code and it worked perfectly to print output to Python screen/terminal/console. So question remains: why would those extracted values from the dictionary above disappear after printing to screen (please try the code with the data to see for yourselves!)?
x = "בראשית ברא אלהים את השמים ואת הארץ"
print("x = ",x)
Upvotes: 1
Views: 824
Reputation: 16224
That is probably not because of your encoding, since python 3
uses utf-8
as default.
More probable issue is that your console uses the font consolas
which has no hebrew support.
Change to a font like courier new
to show the hebrew characters in the console.
On windows - simply hit the icon on top of the window (should be up-left or up-right if your windows is hebrew).
Then hit properties (הגדרות) and choose the font you want (I recommends courier new
).
The problem seems to be the use of the character \u05be
(מקף) in the text. I tried the following when loading the file and it worked as it should:
json_data = open('DATA_1GENESIS.json', encoding="utf8").read().replace('\u05be', '')
Upvotes: 3