I previously summarized the original TPU design here.
Company is still not willing to give details about the processing elements and interconnect. There is no floating point element ("flexpoint"). The interconnect extends to high speed serial links. From a programming perspective, it looks the same to talk between chips or to different units on a single chip. The architecture is completely non coherent and there are no caches, only software managed resources. Massage passing is used to communicate between chips. This is unbelievably similar to the processor I helped architect and implement for REX which, in itself, has roots to Transputer concepts.
"Here is a piece of the matrix and I need to get it to the other chip". Does this mean that the systolic nature of matrix mult. workloads are controlled via microcode?
Neurosteam is a processor-in-memory solution. "Smart Memory Cubes" combined HMC and compute on single die (NeuroCluster). Compute consists of 8% of die area of standard HMC. NeuroCluster is flexible many-core platform based on efficient RISC-V PEs (4 pipe stages, in order) and NeuroSteam co-processors (NST). No caches (except I$), but scratchpad and DMA engines are used. TLB and simple virtual memory. NST is SIMD accelerator that works directly off the scratchpad memory. Achieved 1 MAC/cycle. 128 instances of MST at 1 GHz, sum to 256 GFLOPS.
Info on software stack here. They have a gem5 based system simulator built on Cortex-A15.
At core of IPU is a graph processing base. Underlaying machine learning workloads is a computational graph with vertex representing some compute function on a set of edges that are representing data with weights associated.
The entire neural network models fits in the processor (not in the memory). "We are trying to map a graph to what is in effect a graph processor". The key is in software that allows us to take these complex structures and make to a highly parallel processor with all the memory we need to hold the model inside the processor. Neural networks are expanded into a graph. The software maps the graph to a relatively simple processor with interconnect that is controlled entirely by the compiler.
Dataflow architecture (DPU). 16k independent processors (8 bit RISC-oriented). At core of approach is use of fixed points with stochastic rounding techniques and many small compute elements for high parallelism.