Reputation: 811
I'm cloning a large sample of GitHub projects for an empirical study. I'm assuming it will be faster to download the 80,000 projects with some concurrency, but that's a lot to download.
How can I start ~1,000 processes and then start another after each one finishes? Or, is there some other way I should go about this? Will downloading this much at a faster-than-sequential rate be bad for GitHub's servers?
Here's the relevant code so far:
// Create a CountDownLatch that will only reach 0 when all repositories
// have been downloaded
CountDownLatch doneSignal = new CountDownLatch(numberOfRepositories);
// Start the download for each git repository
for (String URL : gitURLs)
{
new Thread(new Worker(doneSignal, URL)).start();
}
doneSignal.await();
Worker:
public class Worker implements Runnable
{
private final CountDownLatch doneSignal;
private final String URL;
Worker (CountDownLatch doneSignal, String URL)
{
this.doneSignal = doneSignal;
this.URL = URL;
}
@Override
public void run ()
{
try
{
// Run the command line process to download
ProcessBuilder pb =
new ProcessBuilder("git", "clone", "--depth=1", URL, "projects/" + getProjectName(URL));
Process p = pb.start();
p.waitFor();
}
catch (Exception e)
{
e.printStackTrace();
}
doneSignal.countDown();
}
}
Upvotes: 1
Views: 103
Reputation: 30445
There's no need to use multi-threading and custom Java code for a simple task like this. Especially since each thread just spawns an external process using the CLI. It's over-engineering, and you could get the job done more quickly by using something simpler.
It looks like you probably already have a file with the URLs of all the projects you want to clone. I would use a few commands in my text editor (Sublime Text) to add git clone --depth=1
to the beginning of each line and &
to the end (this runs a command asynchronously). If your text editor can't do that easily, a little bash/awk/Perl/Ruby/Python/etc script could do it in no more than a few lines.
Then your list of URLs becomes... a valid shell script, which will clone all the repos in parallel! And you can run it as such.
Note, though, that while running parallel downloads will help you, 1000 is way too many. You can experiment with the number, but you will probably find that running more than 20 at the same time will not help.
Upvotes: 0
Reputation: 41
You can try Java 8 and parallelStream for multithreading your downloads
List<String> gitURLs = new ArrayList<>();
gitURLs.parallelStream().forEach(
URL ->
{
try
{
// Run the command line process to download
ProcessBuilder pb =
new ProcessBuilder("git", "clone", "--depth=1", URL, "projects/" + getProjectName(URL));
Process p = pb.start();
p.waitFor();
}
catch (Exception e)
{
e.printStackTrace();
}
}
);
Upvotes: 0
Reputation: 63955
It's bad for github's servers but it's even worse for your performance. Try maybe 5 or so instead of 1000. To limit the code to X parallel threads, you could use a pool:
CountDownLatch doneSignal = new CountDownLatch(numberOfRepositories);
// Start the download for each git repository
ExecutorService pool = Executors.newFixedThreadPool(5);
for (String URL : gitURLs) {
pool.execute(new Worker(doneSignal, URL));
}
pool.shutdown();
doneSignal.await();
Also works without the latch because you can wait for the pool to become idle via e.g.
pool.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
Upvotes: 3