Reputation: 439
I have downloaded a full website using Httrack Website Copier and now I want to retrieve all image source ('src') urls using Python 3.7.
Already did that but for further use I need those urls to be in plain text but instead they are something like this:
cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg
cid:httpsX3aX2fX2fuploadX2ewikimediaX2eorgX2fwikipediaX2fcommonsX2fthumbX2f6X2f6eX2fBursitisX5fElbowX5fWCX2eJPGX2f800pxX2dBursitisX5fElbowX5fWCX2eJPGX3f20070925222131
cid:httpsX3aX2fX2fuploadX2ewikimediaX2eorgX2fwikipediaX2fcommonsX2fthumbX2f6X2f62X2fPDX2diconX2esvgX2f64pxX2dPDX2diconX2esvgX2epng
cid:httpsX3aX2fX2fuploadX2ewikimediaX2eorgX2fwikipediaX2fcommonsX2fthumbX2f6X2f6eX2fBursitisX5fElbowX5fWCX2eJPGX2f120pxX2dBursitisX5fElbowX5fWCX2eJPGX3f20070925222131
I don't know what these cid:urls are but google sent me to this: Replace (cid:<number>) with chars using Python when extracting text from PDF files which is obviously something maybe somehow related to this problem but it doesn't help me (or at least I don't know how). Also, I thought they are somehow escaped urls so I tried this solution from here which doesn't work either: Decode escaped characters in URL
This is my try to solve this problem into Python:
import urllib.parse
mystr = "cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg"
print(mystr.encode('utf-8'))
print(mystr.encode('utf-8').decode('utf-8', errors='ignore'))
print(mystr.encode('utf-8').decode('utf-8'))
print(urllib.parse.unquote(mystr, encoding='utf-8', errors='replace'))
Note that I tried also to decode them from utf-8 because I thought maybe that way it will work but it doesn't work either.
Result of my code above is:
b'cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg'
cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg
cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg
cid:httpsX3aX2fX2fcommonsX2emX2ewikimediaX2eorgX2fstaticX2fimagesX2fmobileX2fcopyrightX2fcommonswikiX2dwordmarkX2esvg
You can replace it with any of those strings from the beginning of my post and there will not be any change.
Thanks in advance!
Upvotes: 0
Views: 70