Reputation: 26715
I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server?
import urllib
import re
url = "http://www.someurl.com"
# Download the page locally
f = urllib.urlopen(url)
html = f.read()
f.close()
f = open ("temp.htm", "w")
f.write (html)
f.close()
# List only the .TXT / .ZIP files
fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE)
for fname in fnames:
print fname, "..."
f = urllib.urlopen(url + "/" + fname)
#### Here I want to check the filesize to download or not ####
file = f.read()
f.close()
f = open (fname, "w")
f.write (file)
f.close()
@Jon: thank for your quick answer. It works, but the filesize on the web server is slightly less than the filesize of the downloaded file.
Examples:
Local Size Server Size
2.223.533 2.115.516
664.603 662.121
It has anything to do with the CR/LF conversion?
Upvotes: 58
Views: 60199
Reputation: 21
you can use requests to pull this data
File_Name=requests.head(LINK).headers["X-File-Name"]
#And other useful info** like the size of the file from this dict (headers)
#like
File_size=requests.head(LINK).headers["Content-Length"]
Upvotes: 0
Reputation: 540
Quick and reliable one-liner for Python3 using urllib:
import urllib
url = 'https://<your url here>'
size = urllib.request.urlopen(url).info().get('Content-Length', 0)
.get(<dict key>, 0)
gets the key from dict and if the key is absent returns 0 (or whatever the 2nd argument is)
Upvotes: 0
Reputation: 5476
Here is a much more safer way for Python 3:
import urllib.request
site = urllib.request.urlopen("http://python.org")
meta = site.info()
meta.get('Content-Length')
Returns:
'49829'
meta.get('Content-Length')
will return the "Content-Length" header if exists. Otherwise it will be blank
Upvotes: 1
Reputation: 1893
For anyone using Python 3 and looking for a quick solution using the requests
package:
import requests
response = requests.head(
"https://website.com/yourfile.mp4", # Example file
allow_redirects=True
)
print(response.headers['Content-Length'])
Note: Not all responses will have a Content-Length
so your application will want to check to see if it exists.
if 'Content-Length' in response.headers:
... # Do your stuff here
Upvotes: 3
Reputation: 587
@PabloG Regarding the local/server filesize difference
Following is high-level illustrative explanation of why it may occur:
The size on disk sometimes is different from the actual size of the data. It depends on the underlying file-system and how it operates on data. As you may have seen in Windows when formatting a flash drive you are asked to provide 'block/cluster size' and it varies [512b - 8kb]. When a file is written on the disk, it is stored in a 'sort-of linked list' of disk blocks. When a certain block is used to store part of a file, no other file contents will be stored in the same blok, so even if the chunk is no occupuing the entire block space, the block is rendered unusable by other files.
Example: When the filesystem is divided on 512b blocks, and we need to store 600b file, two blocks will be occupied. The first block will be fully utilized, while the second block will have only 88b utilized and the remaining (512-88)b will be unusable resulting in 'file-size-on-disk' being 1024b. This is why Windows has different notations for 'file size' and 'size on disk'.
NOTE: There are different pros & cons that come with smaller/bigger FS block, so do a better research before playing with your filesystem.
Upvotes: 0
Reputation: 169543
Using the returned-urllib-object method info()
, you can get various information on the retrieved document. Example of grabbing the current Google logo:
>>> import urllib
>>> d = urllib.urlopen("http://www.google.co.uk/logos/olympics08_opening.gif")
>>> print d.info()
Content-Type: image/gif
Last-Modified: Thu, 07 Aug 2008 16:20:19 GMT
Expires: Sun, 17 Jan 2038 19:14:07 GMT
Cache-Control: public
Date: Fri, 08 Aug 2008 13:40:41 GMT
Server: gws
Content-Length: 20172
Connection: Close
It's a dict, so to get the size of the file, you do urllibobject.info()['Content-Length']
print f.info()['Content-Length']
And to get the size of the local file (for comparison), you can use the os.stat() command:
os.stat("/the/local/file.zip").st_size
Upvotes: 28
Reputation: 6238
For a python3 (tested on 3.5) approach I'd recommend:
with urlopen(file_url) as in_file, open(local_file_address, 'wb') as out_file:
print(in_file.getheader('Content-Length'))
out_file.write(response.read())
Upvotes: 3
Reputation: 31676
A requests-based solution using HEAD instead of GET (also prints HTTP headers):
#!/usr/bin/python
# display size of a remote file without downloading
from __future__ import print_function
import sys
import requests
# number of bytes in a megabyte
MBFACTOR = float(1 << 20)
response = requests.head(sys.argv[1], allow_redirects=True)
print("\n".join([('{:<40}: {}'.format(k, v)) for k, v in response.headers.items()]))
size = response.headers.get('content-length', 0)
print('{:<40}: {:.2f} MB'.format('FILE SIZE', int(size) / MBFACTOR))
$ python filesize-remote-url.py https://httpbin.org/image/jpeg ... Content-Length : 35588 FILE SIZE (MB) : 0.03 MB
Upvotes: 12
Reputation: 105
In Python3:
>>> import urllib.request
>>> site = urllib.request.urlopen("http://python.org")
>>> print("FileSize: ", site.length)
Upvotes: 7
Reputation: 1846
I have reproduced what you are seeing:
import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]
f = open("out.txt", "r")
print "File on disk:",len(f.read())
f.close()
f = open("out.txt", "w")
f.write(site.read())
site.close()
f.close()
f = open("out.txt", "r")
print "File on disk after download:",len(f.read())
f.close()
print "os.stat().st_size returns:", os.stat("out.txt").st_size
Outputs this:
opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16861
What am I doing wrong here? Is os.stat().st_size not returning the correct size?
Edit: OK, I figured out what the problem was:
import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]
f = open("out.txt", "rb")
print "File on disk:",len(f.read())
f.close()
f = open("out.txt", "wb")
f.write(site.read())
site.close()
f.close()
f = open("out.txt", "rb")
print "File on disk after download:",len(f.read())
f.close()
print "os.stat().st_size returns:", os.stat("out.txt").st_size
this outputs:
$ python test.py
opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16535
Make sure you are opening both files for binary read/write.
// open for binary write
open(filename, "wb")
// open for binary read
open(filename, "rb")
Upvotes: 41
Reputation: 1846
Also if the server you are connecting to supports it, look at Etags and the If-Modified-Since and If-None-Match headers.
Using these will take advantage of the webserver's caching rules and will return a 304 Not Modified status code if the content hasn't changed.
Upvotes: 6
Reputation: 1846
The size of the file is sent as the Content-Length header. Here is how to get it with urllib:
>>> site = urllib.urlopen("http://python.org")
>>> meta = site.info()
>>> print meta.getheaders("Content-Length")
['16535']
>>>
Upvotes: 7