Reputation: 137
Using the Python sqlite3 package I'm selecting some rows from an sqlite database (with a triple inner join) in order to put them in a pandas dataframe at a later stage.
I've tried running the following code to select from the database:
import pandas as pd
import sqlite3
# connect to the database
conn = sqlite3.connect("C:\Quant\Data_Storage2\sqlite\db\scorecard.db")
conn.text_factory = str
c = conn.cursor()
# create pandas df with data from the sqlite db
sql = '''SELECT MACRO_SA.SERIES_ID, MACRO_SA.COUNTRY_ID, MACRO_SA.DATA,
MACRO_SA.REL_DATE, HF_DATA.DATA
FROM MACRO_SA
INNER JOIN MAP_SERIES ON MAP_SERIES.CUR = MACRO_SA.COUNTRY_ID
INNER JOIN HF_DATA ON MAP_SERIES.CUR = MACRO_SA.COUNTRY_ID
WHERE (MAP_SERIES.CRNCY NOT NULL)
AND (HF_DATA.SERIES_ID = 'bg_cur')
'''
df_to_adj = c.execute(sql).fetchall()
conn.commit()
df_to_adj = pd.DataFrame(df_to_adj, columns=['SERIES_ID', 'COUNTRY_ID', 'DATA', 'REL_DATE', 'FX_RATE'])
I get the following error: "sqlite3.OperationalError: Could not decode to UTF-8". Any help is much appreciated.
Upvotes: 0
Views: 3284
Reputation: 86
I had a similar problem. The text in the database seemed to be encoded in another encoding than utf-8. To find out the which encodings worked instead I used a helper function to try out all encodings:
from encodings.aliases import aliases
_encodings = set(aliases.values())
def try_encodings(byte_text: bytes):
for encoding in _encodings:
try:
print(f'Encoding {encoding}: {byte_text.decode(encoding)}')
except (UnicodeDecodeError, LookupError):
pass
Set text_factory
to bytes
and iterate row per row and decode the problem column. If the row with the decoding problem is found, try out all other encodings:
c = conn.cursor()
conn.text_factory = bytes
sql = 'Select * from ...'
c.execute(sql)
col_ix = 0 # Set column index of problem column
while row := c.fetchone():
try:
text = row[col_ix].decode()
except UnicodeDecodeError as e:
print('Cannot decode byte string: ', row[col_ix])
try_encodings(row[col_ix])
break
If you found the correct encoding, you can set it to text_factory. For example cp852:
conn.text_factory = lambda x: x.decode('cp852')
Upvotes: 2
Reputation: 11
conn.text_factory = str
works with Python 2; if you're using Python 3, you have to provide your own magic function which takes in bytes, and spits out whatever you want the string to be. One option is trying to decode the string, and if it fails, return the bytes raw. If you can always handle bytes, just doing
conn.text_factory = bytes
is enough. See https://docs.python.org/3.7/library/sqlite3.html for the gory details.
Upvotes: 0