papasmurf
papasmurf

Reputation: 313

Speeding up process speed of file downloads from the web

I'm writing a program that has to download a bunch of files from the web before it can even run, so I created a function that will download all the files and "initialize" the program called init_program, how it works is it runs through a couple dicts that have urls to a gistfiles on github. It pulls the urls and uses urllib2 to download them. I won't be able to add all the files but you can try it out by cloning the repo here. Here's the function that will create the files from the gists:

def init_program():
    """ Initialize the program and allow all the files to be downloaded
        This will take awhile to process, but I'm working on the processing
        speed """

    downloaded_wordlists = []  # Used to count the amount of items downloaded
    downloaded_rainbow_tables = []

    print("\n")
    banner("Initializing program and downloading files, this may take awhile..")
    print("\n")

    # INIT_FILE is a file that will contain "false" if the program is not initialized
    # And "true" if the program is initialized
    with open(INIT_FILE) as data: 
        if data.read() == "false": 
            for item in GIST_DICT_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} wordlists.. ".format(len(downloaded_wordlists) + 1, 
                                                                                  len(GIST_DICT_LINKS.keys())))
                sys.stdout.flush()
                new_wordlist = open("dicts/included_dicts/wordlists/{}.txt".format(item), "a+") 
                # Download the wordlists and save them into a file
                wordlist_data = urllib2.urlopen(GIST_DICT_LINKS[item])
                new_wordlist.write(wordlist_data.read())
                downloaded_wordlists.append(item + ".txt")
                new_wordlist.close()

            print("\n")
            banner("Done with wordlists, moving to rainbow tables..")
            print("\n")

            for table in GIST_RAINBOW_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} rainbow tables".format(len(downloaded_rainbow_tables) + 1, 
                                                                                    len(GIST_RAINBOW_LINKS.keys())))
                new_rainbowtable = open("dicts/included_dicts/rainbow_tables/{}.rtc".format(table))
                # Download the rainbow tables and save them into a file
                rainbow_data = urllib2.urlopen(GIST_RAINBOW_LINKS[table])
                new_rainbowtable.write(rainbow_data.read())
                downloaded_rainbow_tables.append(table + ".rtc")
                new_rainbowtable.close()

            open(data, "w").write("true").close()  # Will never be initialized again
        else:
            pass

    return downloaded_wordlists, downloaded_rainbow_tables

This works, yes, however it's extremely slow, due to the size of the files, each file has at least 100,000 lines in it. How can I speed up this function to make it faster and more user friendly?

Upvotes: 1

Views: 2318

Answers (2)

Paul Rudin
Paul Rudin

Reputation: 27

You're blocking whilst you wait for each download. So the total time is the sum of the round trip time for each download. Your code will likely spend a lot of time waiting for the network traffic. One way to improve this is not to block whilst you wait for each response. You can do this in several ways. For example by handing off each request to a separate thread (or process), or by using an event loop and coroutines. Read up on the threading and asyncio modules.

Upvotes: 0

GustavoIP
GustavoIP

Reputation: 933

Some weeks ago I faced a similar situation where it was needed to download many huge files but all simple pure Python solutions that I found was not good enough in terms of download optimization. So I found Axel — Light command line download accelerator for Linux and Unix

What is Axel?

Axel tries to accelerate the downloading process by using multiple connections for one file, similar to DownThemAll and other famous programs. It can also use multiple mirrors for one download.

Using Axel, you will get files faster from Internet. So, Axel can speed up a download up to 60% (approximately, according to some tests).

Usage: axel [options] url1 [url2] [url...]

--max-speed=x       -s x    Specify maximum speed (bytes per second)
--num-connections=x -n x    Specify maximum number of connections
--output=f      -o f    Specify local output file
--search[=x]        -S [x]  Search for mirrors and download from x servers
--header=x      -H x    Add header string
--user-agent=x      -U x    Set user agent
--no-proxy      -N  Just don't use any proxy server
--quiet         -q  Leave stdout alone
--verbose       -v  More status information
--alternate     -a  Alternate progress indicator
--help          -h  This information
--version       -V  Version information

As axel is written in C and there's no C extension for Python, so I used the subprocess module to execute him externally and works perfectly for me.

You can do something like this:

cmd = ['/usr/local/bin/axel', '-n', str(n_connections), '-o',
               "{0}".format(filename), url]
process = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE)

You can also parse the progress of each download parsing the output of the stdout.

    while True:
        line = process.stdout.readline()
        progress = YOUR_GREAT_REGEX.match(line).groups()
        ...

Upvotes: 1

Related Questions