pldimitrov
pldimitrov

Reputation: 1727

How to crawl a web page for files of certain size

I need to crawl a list of several thousand hosts and find at least two files rooted there that are larger than some value, given as an argument. Can any popular (python based?) tool possibly help?

Upvotes: 0

Views: 659

Answers (2)

hejibo
hejibo

Reputation: 11

Here is how I did it. See the code below.

import urllib2
url = 'http://www.ueseo.org'
r = urllib2.urlopen(url)
print len(r.read())

Upvotes: 1

Zachary Richey
Zachary Richey

Reputation: 302

Here is an example of how you can get the filesize of an file on a HTTP server.

import urllib2

def sizeofURLResource(url):
    """
    Return the size of an resource at 'url' in bytes
    """
    info = urllib2.urlopen(url).info()
    return info.getheaders("Content-Length")[0]

There is also an library for building web scrapers here: http://dev.scrapy.org/ but I don't know much about it(just googled honestly).

Upvotes: 2

Related Questions