Reputation: 1105
I have several GB of data encoded in different xml files. For some reasons, the (closed source) program generating these xml files encode the text with an url-like representation, e.g. '08.06.2016 22:41:35'
becomes 08%2E06%2E2016%2022%3A41%3A35
There are mostly spaces, (decimal) dots and colon in the data I am interested in but I need to keep the code open to any kind of hex representation.
For now, I am using urllib.parse.unquote
. It is however very slow: using profiler I found out that 90% of the time spent by my data mining algorithm is due to urllib.parse.unquote
. You can see below how it compares with replace.
from urllib.parse import unquote
from time import clock
t0=clock()
for i in range(10000):
unquote('08%2E06%2E2016')
t1=clock()
t2=clock()
for i in range(10000):
'08%2E06%2E2016'.replace('%2E','\x2E')
t3=clock()
print('unquote time: ',t1-t0,'\nreplace time: ',t3-t2)
unquote time: 0.12173581222984353
replace time: 0.009713842143412421
I could try to chain all the hex I know with replace, but I'm still afraid to miss something.
I have tried to use re.sub
or similar but I was unsuccessfull : it is not so trivial to replace '%' by '\x'.
Any idea?
I'm using Python 3.5
Upvotes: 0
Views: 133
Reputation: 465
I don't think it can be done much quicker in pure Python, but unquote_to_bytes
gives about 2x speedup on my machine:
from urllib.parse import unquote_to_bytes
unquote_to_bytes('08%2E06%2E2016').decode()
Upvotes: 1