Deep neural networks are at the forefront of many advances in machine learning. But these advancements come at a steep cost computationally. With sometimes hundreds of millions, or even billions of parameters to continually adjust during training, these networks rely on specialized hardware to make the computation feasible.
The primary operation required to train a neural network is a matrix multiplication. Each individual operation is not especially expensive, but with the massive numbers of parameters to compute, proportionately massive parallelization of these operations is necessary. Graphics Processing Units (GPU) are well suited to the task, with many GPUs able to perform thousands of computations in parallel. Compare that with a Central Processing Unit (CPU), where a higher-end chip may have a few tens of cores.
SLIDE feed-forward and backpropagation (📷: S. Daghaghi et al.)
The problem with this situation is that GPUs are highly specialized in the types of computations that they are useful for and can come with some high price tags, so they are not nearly as prevalent as CPUs. This limits who is reasonably capable of training very large deep learning models to well-funded organizations, leaving small businesses and hobbyists out of all the fun.
Computer scientists at Rice University are working to democratize machine learning with some algorithmic advancements that can allow CPUs to perform better than GPUs in deep learning model training. The team built upon the open source sub-linear deep learning engine (SLIDE) that recasts model training as a problem that can be solved with sparse hash table based back-propagation, rather than matrix multiplication. In their research, they uncovered ways in which the current implementation of SLIDE is less than optimal, and have proposed modifications that exploit technological advances available to modern CPUs.
Optimized vs. native SLIDE (📷: S. Daghaghi et al.)
To take advantage of CPU memory caching, the team modified the SLIDE algorithm to store all data for a given batch in a contiguous vector, rather than using a number of different vectors, which prevents many unnecessary cache misses. Another clever trick the team employed makes use of the ever-widening size of CPU registers. In particular, they are exploiting 512 bit registers by loading multiple 32 bit data points into a single register, then performing a single operation, such as an add, on all data points simultaneously.
These enhancements to SLIDE have resulted in deep learning algorithms, with hundreds of millions of parameters, that train a very impressive fifteen times faster on CPUs than they do on GPUs. Reducing the cost of entry for large-scale machine learning training has the potential to increase innovation on many fronts — it will be interesting to watch this, and other algorithmic enhancements, in the years to come.