Reputation: 1759
I'm having a lot of trouble using python's sqlite3 library with UTF-8 strings. I need this encoding because I am working people's names, in my database.
My SQL schema for the desired table is:
CREATE TABLE senators (id integer, name char);
I would like to do the following in Python (ignore the very ugly way I wrote the select statement. I did it this way for debugging purposes):
statement = u"select * from senators where name like '" + '%'+row[0]+'%'+"'"
c.execute(statement)
row[0] is the name of each row in a file that has this type of entry:
Dário Berger,1
Edison Lobão,1
Eduardo Braga,1
While I have a non empty result for names like Eduardo Braga, any time my string has UTF-8 characters, I get a null result.
I have checked that my file has in fact been saved with UTF-8 encoding (Microsoft Notepad). On a Apple mac, in the terminal, I used the PRAGMA command in the sqlite3 shell to check the encoding:
sqlite> PRAGMA encoding;
UTF-8
Does anybody have an idea what I can do here?
EDIT - Complete example: Python script that creates the databases, and populates with initial data from senators.csv (file):
# -*- coding: utf-8 -*-
import sqlite3
import csv
conn = sqlite3.connect('senators.db')
c = conn.cursor()
c.execute('''CREATE TABLE senators (id integer, name char)''')
c.execute('''CREATE TABLE polls (id integer, senator char, vote integer, FOREIGN KEY(senator) REFERENCES senators(name))''')
with open('senators.csv', encoding='utf-8') as f:
f_csv = csv.reader(f)
for row in f_csv:
c.execute(u"INSERT INTO senators VALUES(?,?)", (row[1], row[0]))
conn.commit()
conn.close()
Script that populates the polls table, using Q1.txt (file).
import csv
import sqlite3
import re
import glob
conn = sqlite3.connect('senators.db')
c = conn.cursor()
POLLS = {
'senator': 'votes/senator/Q*.txt',
'deputee': 'votes/deputee/Q*.txt',
}
s_polls = glob.glob(POLLS['senator'])
d_polls = glob.glob(POLLS['deputee'])
for poll in s_polls:
m = re.match('.*Q(\d+)\.txt', poll)
poll_id = m.groups(0)
with open(poll, encoding='utf-8') as p:
f_csv = csv.reader(p)
for row in f_csv:
c.execute(u'SELECT id FROM senators WHERE name LIKE ?', ('%'+row[0]+'%',))
data = c.fetchone()
print(data) # I should not get None results here, but I do, exactly when the query has UTF-8 characters.
Note the file paths, if you want to test these scripts out.
Upvotes: 3
Views: 5093
Reputation: 1759
Ok guys,
After a lot of trouble, I found out that the problem was that the encodings, all though were both considered UTF-8, were still different anyways. The difference was that while the database was decomposed UTF-8 (ã = a + ~), my input was in precomposed form (one code for the ã character).
To fix it, I had to convert all my input data to the decomposed form.
from unicodedata import normalize
with open(poll, encoding='utf-8') as p:
f_csv = csv.reader(p)
for row in f_csv:
name = normalize("NFD",row[0])
c.execute(u'SELECT id FROM senators WHERE name LIKE ?', ('%'+name+'%',))
See this article, for some excellent information on the subject.
Upvotes: 3
Reputation: 26569
From the SQLite docs:
Important Note: SQLite only understands upper/lower case for ASCII characters by default. The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range. For example, the expression
'a' LIKE 'A'
is TRUE but'æ' LIKE 'Æ'
is FALSE.
Also, use query parameters. Your query is vulnerable to SQL injection.
Upvotes: 0