Thursday, August 3, 2017

TPU Research Notes

Here is a collection of notes from my research on TPU architecture.

In-Datacenter Performance Analysis of a Tensor Processing Unit​ (ISCA 2017)


First TPU developed is to accelerate the inference phase of neural networks (not training).  The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit and a large 28 MB software-managed on-chip memory.  An activation unit takes care of the hardware activation functions (ReLU, sin, cos).  One benefit of TPU architecture is deterministic execution model when compared to non-deterministic (time varying) CPU and GPUs.  15X-20X faster than contemporary GPU or CPU and better 30X-80X better TOPS/Watt.

Introduction to NNs

Large data sets in the cloud and availability of compute power enabled renaissance in machine learning.  Use of deep neural networks led to breakthroughs in reducing errors rates of speech recognition, image recognition, etc.

NN imitate brain-like functionality and based on simple artificial neuron.  This "neuron" is a nonlinear function of a weighted sum of its inputs.   The "neurons" are connected in layers.  The "deep" part of DNN comes from going beyond a few layers.  Made possible because because of higher compute capability.

Two phases of NN are called training (or learning) and inference (or prediction).  Today, all training is done in floating point using GPUs.  A step called quantization transforms floating-point numbers into narrow integers (sometimes 8 bit) which are good enough for inference.


Designed to be a coprocessor on PCIe bus.  The host sends TPU instructions for the coprocessor to execute.

TPU instructions are sent from the host over PCIe into an instruction buffer.  The main instructions include:

  • Read_Host_Memory:  Read data from CPU host memory into Unified Buffer.
  • Read_Weights:  Reads weights from weight memory into the Weight FIFO as input to Matrix Unit
  • MatrixMultiply/Convolve:  Perform matrix multiply or convolution from Unified Buffers into Accumulators.
  • Activate: Performs the nonlinear function of the artificial neural.  It's inputs are the Accumulators and its output is the Unified Buffer.
  • Write_Host_Memory: Write data from Unified Buffer into CPU host memory
Data flows in through the left and weights are loaded from the top.  Execution of matrix multiplies occurs in a systolic manner (no fetch, excute, writeback sequence).

(Interesting fact. 15% of papers at ISCA 2016 were on hardware accelerators for NN. 9 mentioned CNN, only 2 mentioned NN).


Results TL;DR they are good.

Increasing memory bandwidth has the bigger performance impact.  Performance up by 3X when memory increases 4X.  Using GDDR5 vs. DDR3 would have improved memory bandwidth by more than a factor of 5.

Clock rate has little benefit with or without more accelerators.  More aggressive design may have increased clock rate by 50%.    Why? MLP (multi-layer perceptron) and LSTMs are memory bound, but only CNN are compute bound.