Reputation: 2271

Threaded wget - minimalizing resources

I have a script that is getting the GeoIP locations of various ips, this is run daily and I'm going to expect to have around ~50,000 ips to look up.

I have a GeoIP system set up - I just would like to eliminate having to run wget 50,000 times per report.

What I was thinking is, there must be some way to have wget open a connection with the url - then pass the ips, that way it doesn't have to re-establish the connection.

Any help will be much appreciated.

Upvotes: 1

Answers (3)

Shade

Reputation: 10001

You could also write a small program (in Java or C or whatever) that sends the list of files as a POST request and the server returns an object with data about them. Shouldn't be too slow either.

Upvotes: 0

Druid

Reputation: 6453

You could also write a threaded Ruby script to run wget on multiple input files simultaneously to speed the process up. So if you have 5 files containing 10,000 addresses each, you could use this script:

#!/usr/bin/ruby

threads = []

for file in ARGV
  threads << Thread.new(file) do |filename|
    system("wget -i #{filename}")
  end
end

threads.each { |thrd| thrd.join }

Each of these threads would use one connection to download all addresses in a file. The following command then means only 5 connections to the server to download all 50,000 files.

./fetch.rb "list1.txt" "list2.txt" "list3.txt" "list4.txt" "list5.txt"

Upvotes: 0

ephemient

Reputation: 204698

If you give wget several addresses at once, with consecutive addresses belonging to the same HTTP/1.1 (Connection: keep-alive) supporting server, wget will re-use the already-established connection.

If there are too many addresses to list on the command line, you can write them to a file and use the -i/--input-file= option (and, per UNIX tradition, -i-/--input-file=- reads standard input).

There is, however, no way to preserve a connection across different wget invocations.

Upvotes: 2

Threaded wget - minimalizing resources

Answers (3)

Related Questions