Reputation: 4436
I am trying to download a file and write it to disk, but somehow I am lost in encoding decoding land.
from urllib.request import urlopen
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urllib.request.urlopen(url) as response:
data = response.read()
filename = 'test.txt'
file_ = open(filename, 'wb')
file_.write(data)
file_.close()
Here data is a byte string. If I check the file I find a bunch of strange characters. I tried
import chardet
the_encoding = chardet.detect(data)['encoding']
but this results in None. So I don't really know how the data I downloaded is encoded?
If I just type "http://export.arxiv.org/e-print/supr-con/9608001" into the browser, it downloads a file that I can view with a text editor and it's a perfectly fine .tex file.
Upvotes: 0
Views: 174
Reputation: 30238
Apply the python-magic
library.
python-magic
is a Python interface to thelibmagic
file type identification library.libmagic
identifies file types by checking their headers according to a predefined list of file types. This functionality is exposed to the command line by the Unix commandfile
.
Commented script (works on Windows 10, Python 3.8.6):
# stage #1: read raw data from a url
from urllib.request import urlopen
import gzip
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urlopen(url) as response:
rawdata = response.read()
# stage #2: detect raw data type by its signature
print("file signature", rawdata[0:2])
import magic
print( magic.from_buffer(rawdata[0:1024]))
# stage #3: decompress raw data and write to a file
data = gzip.decompress(rawdata)
filename = 'test.tex'
file_ = open(filename, 'wb')
file_.write(data)
file_.close()
# stage #4: detect encoding of the data ( == encoding of the written file)
import chardet
print( chardet.detect(data))
Result: .\SO\68307124.py
file signature b'\x1f\x8b'
gzip compressed data, was "9608001.tex", last modified: Thu Aug 8 04:57:44 1996, max compression, from Unix
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
Upvotes: 1