**Nervana Systems**

https://www.nextplatform.com/2016/08/08/deep-learning-chip-upstart-set-take-gpus-task/

Company is still not willing to give details about the processing elements and interconnect. There is no floating point element ("flexpoint"). The interconnect extends to high speed serial links. From a programming perspective, it looks the same to talk between chips or to different units on a single chip. The architecture is completely non coherent and there are no caches, only software managed resources. Massage passing is used to communicate between chips. This is

*unbelievably*similar to the processor I helped architect and implement for REX which, in itself, has roots to Transputer concepts.

"Here is a piece of the matrix and I need to get it to the other chip". Does this mean that the systolic nature of matrix mult. workloads are controlled via microcode?

**Neurosteam**

https://www.nextplatform.com/2017/02/02/memory-core-new-deep-learning-research-chip/

https://arxiv.org/pdf/1701.06420.pdf

Neurosteam is a processor-in-memory solution. "Smart Memory Cubes" combined HMC and compute on single die (NeuroCluster). Compute consists of 8% of die area of standard HMC. NeuroCluster is flexible many-core platform based on efficient RISC-V PEs (4 pipe stages, in order) and NeuroSteam co-processors (NST). No caches (except I$), but scratchpad and DMA engines are used. TLB and simple virtual memory. NST is SIMD accelerator that works directly off the scratchpad memory. Achieved 1 MAC/cycle. 128 instances of MST at 1 GHz, sum to 256 GFLOPS.

Info on software stack here. They have a gem5 based system simulator built on Cortex-A15.

**Graphcore**

https://www.nextplatform.com/2017/03/09/early-look-startup-graphcores-deep-learning-chip/

At core of IPU is a graph processing base. Underlaying machine learning workloads is a computational graph with vertex representing some compute function on a set of edges that are representing data with weights associated.

The entire neural network models fits in the processor (not in the memory). "We are trying to map a graph to what is in effect a graph processor". The key is in software that allows us to take these complex structures and make to a highly parallel processor with all the memory we need to hold the model inside the processor. Neural networks are expanded into a graph. The software maps the graph to a relatively simple processor with interconnect that is controlled entirely by the compiler.

**Wave Computing**

https://www.nextplatform.com/2016/09/07/next-wave-deep-learning-architectures/

__https://www.bdti.com/InsideDSP/2016/11/17/WaveComputing__

Dataflow architecture (DPU). 16k independent processors (8 bit RISC-oriented). At core of approach is use of fixed points with stochastic rounding techniques and many small compute elements for high parallelism.