YoYoYo
YoYoYo

Reputation: 439

Decode Httrack encoded urls in Python?

I have downloaded a full website using Httrack Website Copier and now I want to retrieve all image source ('src') urls using Python 3.7.

Already did that but for further use I need those urls to be in plain text but instead they are something like this:

cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg
cid:httpsX3aX2fX2fuploadX2ewikimediaX2eorgX2fwikipediaX2fcommonsX2fthumbX2f6X2f6eX2fBursitisX5fElbowX5fWCX2eJPGX2f800pxX2dBursitisX5fElbowX5fWCX2eJPGX3f20070925222131
cid:httpsX3aX2fX2fuploadX2ewikimediaX2eorgX2fwikipediaX2fcommonsX2fthumbX2f6X2f62X2fPDX2diconX2esvgX2f64pxX2dPDX2diconX2esvgX2epng
cid:httpsX3aX2fX2fuploadX2ewikimediaX2eorgX2fwikipediaX2fcommonsX2fthumbX2f6X2f6eX2fBursitisX5fElbowX5fWCX2eJPGX2f120pxX2dBursitisX5fElbowX5fWCX2eJPGX3f20070925222131

I don't know what these cid:urls are but google sent me to this: Replace (cid:<number>) with chars using Python when extracting text from PDF files which is obviously something maybe somehow related to this problem but it doesn't help me (or at least I don't know how). Also, I thought they are somehow escaped urls so I tried this solution from here which doesn't work either: Decode escaped characters in URL

This is my try to solve this problem into Python:

import urllib.parse

mystr = "cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg"

print(mystr.encode('utf-8'))
print(mystr.encode('utf-8').decode('utf-8', errors='ignore'))
print(mystr.encode('utf-8').decode('utf-8'))

print(urllib.parse.unquote(mystr, encoding='utf-8', errors='replace'))

Note that I tried also to decode them from utf-8 because I thought maybe that way it will work but it doesn't work either.

Result of my code above is:

b'cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg'
cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg
cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg
cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg

You can replace it with any of those strings from the beginning of my post and there will not be any change.

Thanks in advance!

Upvotes: 0

Views: 70

Answers (0)

Related Questions