Reputation: 25
I have been reading about different NLP models like word2vec and GloVe, and how these can be parallelized because they are mostly just dot products. However, I am a bit confused by this, because computing the gradient & updating the model depends on the current values of the parameters/vectors. How is this done in parallel/asynchronously? How do you know when to update the global parameters using the gradients being computed stochastically by each of the threads?
Upvotes: 1
Views: 116
Reputation: 54173
Generally, doing everything approximately and with some lags/drift between nodes doesn't hurt that much. Two of the key early papers were:
"HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent"
by Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
https://papers.nips.cc/paper/2011/hash/218a0aefd1d1a4be65601cc6ddc1520e-Abstract.html
ABSTRACT: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.
"Large Scale Distributed Deep Networks"
by Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, Andrew Ng
https://papers.nips.cc/paper/2012/hash/6aca97005c68f1206823815f66102863-Abstract.html
Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports for a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 100x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Upvotes: 2