hyamanieu
hyamanieu

Reputation: 1105

Decode url-like strings much quicker

I have several GB of data encoded in different xml files. For some reasons, the (closed source) program generating these xml files encode the text with an url-like representation, e.g. '08.06.2016 22:41:35' becomes 08%2E06%2E2016%2022%3A41%3A35

There are mostly spaces, (decimal) dots and colon in the data I am interested in but I need to keep the code open to any kind of hex representation.

For now, I am using urllib.parse.unquote. It is however very slow: using profiler I found out that 90% of the time spent by my data mining algorithm is due to urllib.parse.unquote. You can see below how it compares with replace.

from urllib.parse import unquote
from time import clock

t0=clock()
for i in range(10000):
    unquote('08%2E06%2E2016')
t1=clock()   

t2=clock()
for i in range(10000):
    '08%2E06%2E2016'.replace('%2E','\x2E')
t3=clock()

print('unquote time: ',t1-t0,'\nreplace time: ',t3-t2)

unquote time: 0.12173581222984353

replace time: 0.009713842143412421

I could try to chain all the hex I know with replace, but I'm still afraid to miss something. I have tried to use re.sub or similar but I was unsuccessfull : it is not so trivial to replace '%' by '\x'.

Any idea?

I'm using Python 3.5

Upvotes: 0

Views: 133

Answers (1)

eigil
eigil

Reputation: 465

I don't think it can be done much quicker in pure Python, but unquote_to_bytes gives about 2x speedup on my machine:

from urllib.parse import unquote_to_bytes
unquote_to_bytes('08%2E06%2E2016').decode()

Upvotes: 1

Related Questions