2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally designed for real-time image synthesis, which requires highly parallel architectures that happen to be fitting for deep models. As their usage for AI has increased, GPUs have been equipped with dedicated sub-components referred to as tensor cores, and deep-learning specialized chips such as Google’s Tensor Processing Units (TPUs) have been developed.
A GPU possesses several thousands of parallel units, and its own fast memory. The limiting factor is usually not the number of computing units but the read-write operations to memory. The slowest link is between the CPU memory and the GPU memory, and consequently one should avoid copying data across devices. Moreover, the structure of the GPU itself involves multiple levels of cache memory, which are smaller but faster, and computation should be organized to avoid copies between these different caches.
This is achieved, in particular, by organizing the computation in batches of samples that can fit entirely in the GPU memory and are processed in parallel. When an operator combines a sample and model parameters, both have to be moved to the cache memory near the actual computing units. Proceeding by batches allows for copying the model parameters only once, instead of doing it for every sample. In practice, a GPU processes a batch that fits in memory almost as quickly as a single sample.
A standard GPU has a theoretical peak performance of $10^{13}-10^{14}$ floating point operations (FLOPs) per second, and its memory typically ranges from 8 to 80 gigabytes. The standard $FP_{32}$ encoding of float numbers is on 32 bits, but empirical results show that using encoding on 16 bits, or even less for some operands, does not degrade performance.
We come back in § 3.6 to the large size of deep architectures.