Reputation: 1727

How to crawl a web page for files of certain size

I need to crawl a list of several thousand hosts and find at least two files rooted there that are larger than some value, given as an argument. Can any popular (python based?) tool possibly help?

Upvotes: 0

Answers (2)

hejibo

Reputation: 11

Here is how I did it. See the code below.

import urllib2
url = 'http://www.ueseo.org'
r = urllib2.urlopen(url)
print len(r.read())

Upvotes: 1

Zachary Richey

Reputation: 302

Here is an example of how you can get the filesize of an file on a HTTP server.

import urllib2

def sizeofURLResource(url):
    """
    Return the size of an resource at 'url' in bytes
    """
    info = urllib2.urlopen(url).info()
    return info.getheaders("Content-Length")[0]

There is also an library for building web scrapers here: http://dev.scrapy.org/ but I don't know much about it(just googled honestly).

Upvotes: 2

How to crawl a web page for files of certain size

Answers (2)

Related Questions