Reputation: 27
I have a set of python scripts (https://github.com/hvdwolf/wikiscripts) that parse wikidumps to make it into gpx/osm/csv/sql/sqlite dumps to be used as POI files in Nav apps. I only parse the articles that have coordinates. For this I use the externallinks dumps that contain sql insert statements. The sql statements containing the "geohack.php" substring do contain the coordinates. I import these in an sqlite database to be used as reference for the article dumps. They are all utf-8 dumps and parsing all "western type" files works fine, but the languages like arabic, farsi, russian, japanese, greek, chinese and the others don't work. Obviously I'm doing something wrong.
The strings I get for the titles are:
%D9%85%D8%A7%D9%81%D8%B8%D8%A9_%D8%A7%D9%84%D8%A8%D8%AF%D8%A7%D8%A6%D8%B9 %D8%A3%D9%88%D8%B1%D9%8A%D9%88%D9%8A%D9%84%D8%A7 Battle_of_Nicopolis
Qingdao
So some normal characters are OK. The rest is gibberish (to me). I already did some test where I simply read the dump and write to utf-8 encoded text file (line in => line out) and then it works fine, but somewhere in the string handling functions and "re." functions it changes my unicode text.
Edit: My python script starts with: # -- coding: utf-8 --
My code (the relevant part, including python2 and python3 statements, and some remarks to display what I already tried):
with gzip.open(externallinks_file, 'r') as single_externallinksfile:
#reader = codecs.getreader("utf-8")
#single_externallinksfile = reader(single_externallinksfile)
#with codecs.getreader('utf-8')gzip.open(externallinks_file, 'r') as single_externallinksfile:
linecounter = 0
totlinecounter = 0
filelinecounter = 0
# We need to read line by line as we have massive files, sometimes multiple GBs
for line in single_externallinksfile:
if sys.version_info<(3,0,0):
line = unicode(line, 'utf-8')
else:
line = line.decode("utf-8")
if "INSERT INTO" in line:
insert_statements = line.split("),(")
for statement in insert_statements:
#statement = statement.decode("utf-8")
filelinecounter += 1
#if ("geohack.php?" in statement) and (("pagename" in statement) or ("src=" in statement)):
# src can also be in the line, but is different and we leave it out for now
if ("geohack.php?" in statement) and ("pagename" in statement) and ("params" in statement):
language = ""
region = ""
poitype = ""
content = re.findall(r'.*?pagename=(.*?)\'\,\'',statement,flags=re.IGNORECASE)
if len(content) > 0: # We even need this check due to corrupted lines
splitcontent = content[0].split("&")
title = splitcontent[0]
#title = title.decode('utf8')
for subcontent in splitcontent:
if "language=" in subcontent:
language = subcontent.replace("language=","")
#print('taal is: ' + language)
if "params=" in subcontent:
params_string = subcontent.replace("params=","").split("_")
latitude,longitude,poitype,region = get_coordinates_type_region(params_string)
if ( str(latitude) != "" and str(longitude) != "" and (str(latitude) != "0") or (str(longitude) != "0")):
if GENERATE_SQL == "YES":
sql_file.write('insert into ' + file_prefix + '_externallinks values ("' + title + '","' + str(latitude) + '","' + str(longitude) + '","' + language + '","' + poitype + '","' + region + '");\n')
if CREATE_SQLITE == "YES":
sqlcommand = 'insert into ' + file_prefix + '_externallinks values ("' + title + '","' + str(latitude) + '","' + str(longitude) + '","' + language + '","' + poitype + '","' + region +'");'
#print(sqlcommand)
cursor.execute(sqlcommand)
linecounter += 1
if linecounter == 10000:
if CREATE_SQLITE == "YES":
# Do a databse commit every 10000 rows
wikidb.commit()
totlinecounter += linecounter
linecounter = 0
print('\nProcessed ' + str(totlinecounter) + ' lines out of ' + str(filelinecounter) + ' sql line statements. Elapsed time: ' + str(datetime.datetime.now().replace(microsecond=0) - start_time))
Upvotes: 1
Views: 163
Reputation: 879103
It looks like the titles are percent-encoded.
try:
# Python 3
from urllib.parse import unquote
except ImportError:
# Python 2
from urllib import unquote
percent_encoded = '''
%D9%85%D8%A7%D9%81%D8%B8%D8%A9_%D8%A7%D9%84%D8%A8%D8%AF%D8%A7%D8%A6%D8%B9
%D8%A3%D9%88%D8%B1%D9%8A%D9%88%D9%8A%D9%84%D8%A7
Battle_of_Nicopolis
Qingdao
'''
print(unquote(percent_encoded))
yields
مافظة_البدائع
أوريويلا
Battle_of_Nicopolis
Qingdao
Upvotes: 1