Saturday, September 16, 2017

Notes on Machine Learning Hardware

I previously summarized the original TPU design here.

Nervana Systems

https://www.nextplatform.com/2016/08/08/deep-learning-chip-upstart-set-take-gpus-task/

Company is still not willing to give details about the processing elements and interconnect.  There is no floating point element ("flexpoint").  The interconnect extends to high speed serial links.  From a programming perspective, it looks the same to talk between chips or to different units on a single chip.  The architecture is completely non coherent and there are no caches, only software managed resources.   Massage passing is used to communicate between chips.  This is unbelievably similar to the processor I helped architect and implement for REX which, in itself, has roots to Transputer concepts.

"Here is a piece of the matrix and I need to get it to the other chip". Does this mean that the systolic nature of matrix mult. workloads are controlled via microcode?


Neurosteam

https://www.nextplatform.com/2017/02/02/memory-core-new-deep-learning-research-chip/
https://arxiv.org/pdf/1701.06420.pdf

Neurosteam is a processor-in-memory solution. "Smart Memory Cubes" combined HMC and compute on single die (NeuroCluster).   Compute consists of 8% of die area of standard HMC.  NeuroCluster is flexible many-core platform based on efficient RISC-V PEs (4 pipe stages, in order) and NeuroSteam  co-processors (NST).  No caches (except I$), but scratchpad and DMA engines are used.  TLB and simple virtual memory.  NST is SIMD accelerator that works directly off the scratchpad memory.  Achieved 1 MAC/cycle. 128 instances of MST at 1 GHz, sum to 256 GFLOPS.

Info on software stack here.  They have a gem5 based system simulator built on Cortex-A15.

Graphcore

https://www.nextplatform.com/2017/03/09/early-look-startup-graphcores-deep-learning-chip/

At core of IPU is a graph processing base.  Underlaying machine learning workloads is a computational graph with vertex representing some compute function on a set of edges that are representing data with weights associated.

The entire neural network models fits in the processor (not in the memory).  "We are trying to map a graph to what is in effect a graph processor".  The key is in software that allows us to take these complex structures and make to a highly parallel processor with all the memory we need to hold the model inside the processor.  Neural networks are expanded into a graph.  The software maps the graph to a relatively simple processor with interconnect that is controlled entirely by the compiler.

Wave Computing

https://www.nextplatform.com/2016/09/07/next-wave-deep-learning-architectures/
https://www.bdti.com/InsideDSP/2016/11/17/WaveComputing

Dataflow architecture (DPU).  16k independent processors (8 bit RISC-oriented).  At core of approach is use of fixed points with stochastic rounding techniques and many small compute elements for high parallelism.

Friday, August 25, 2017

Decomposing TileLink2 in RocketChip: A Barebones System

Quick post.  I spent some time yesterday/today hacking at the TileLink2 implementation in RocketChip.

(As a matter of coincidence, SiFive recently released a draft spec of the protocol.)

My preferred method of learning is first to understand the engineer's original intent before enhancing or leveraging code.  This means taking things apart.

The most difficult aspect in getting something up was understanding the inner workings of SystemBus class.  I kept hitting run time assertions and it was helpful to manually sketch out the connections between the TL objects:


Here is a summary of my findings.

I identified the minimal set of configuration knobs for this experiment.


// Barebones system configuration
class BarebonesSystemConfig extends Config((site, here, up) => {
  case XLen => 64
  case BankedL2Params => BankedL2Params(nMemoryChannels = 0,
    nBanksPerChannel = 1,
    coherenceManager = { // FIXME: not really a coherence manager
      case (q, _) =>
        implicit val p = q
        val cork = LazyModule(new TLCacheCork(unsafe = true))
        (cork.node, cork.node)
    })
  case SystemBusParams => SystemBusParams(beatBytes = site(XLen) / 8, blockBytes = site(CacheBlockBytes))
  case PeripheryBusParams => PeripheryBusParams(beatBytes = site(XLen) / 8, blockBytes = site(CacheBlockBytes))
  case MemoryBusParams => MemoryBusParams(beatBytes = 8, blockBytes = site(CacheBlockBytes))
  case CacheBlockBytes => 64
  case DTSTimebase => BigInt(1000000) // 1 MHz
  case DTSModel => "bboneschip"
  case DTSCompat => Nil
  case TLMonitorBuilder => (args: TLMonitorArgs) => Some(LazyModule(new TLMonitor(args)))
  case TLCombinationalCheck => false
  case TLBusDelayProbability => 0.0
})

This is the system itself.  It extends BaseComplex.  Not RocketComplex.


class BarebonesSystem(implicit p: Parameters) extends BaseCoreplex {
  // connect single master device (fuzzer) to system bus (sbus)
  sbus.fromSyncPorts() :=* LazyModule(new TLFuzzer(1)).node

  // connect single slave device (memory) to the periphery bus (pbus)
  LazyModule(new TLTestRAM(AddressSet(0x0, 0xfff))).node :=* pbus.toFixedWidthSingleBeatSlave(4)

  override lazy val module = new BarebonesSystemModule(this)
}
class BarebonesSystemModule[T <: BarebonesSystem](_outer: T) extends BaseCoreplexModule(_outer) {}

This should be familiar.  It is your top level Chisel3 module.


// Chisel top module
class Top(implicit val p: Parameters) extends Module {
  val io = new Bundle {
    val success = Bool(OUTPUT)
  }
  val dut = Module(LazyModule(new BarebonesSystem).module)
  dut.reset := reset
}

Here are the important parts of the system generator.  This will generate your firrtl and verilog output.


// System generator
class System extends HasGeneratorUtilities {
  val names = new ParsedInputNames(
    targetDir = ".",
    topModuleProject = "com.bbones",
    topModuleClass = "Top",
    configProject = "",
    configs = "system"
  )

  val config = new BarebonesSystemConfig
  val params = Parameters.root(config.toInstance)
  val circuit = elaborate(names, params)

  def generateArtefacts: Unit =
    ElaborationArtefacts.files.foreach {
      case (extension, contents) =>
        writeOutputFile(names.targetDir, s"${names.configs}.${extension}", contents())
    }

  def generateFirrtl: Unit =
    Driver.dumpFirrtl(circuit, Some(new File(names.targetDir, s"${names.configs}.fir")))

  def generateVerilog: Unit = {
    import sys.process._
    val firrtl = "/home/edc/gitrepos/rocket-chip/firrtl/utils/bin/firrtl"
    val path = s"${names.targetDir}/${names.configs}"
    val bin = s"$firrtl -i $path.fir -o $path.v -X verilog"
    bin.!
  }

  generateArtefacts
  generateFirrtl
  generateVerilog
}

Finally, one should never forget unit testing!


// Unit test
class SystemSpec extends FlatSpec with Matchers {
  "An instance of the system generator " should "compile" in {
    val s = new System
  }
}

Thursday, August 17, 2017

Becoming Functional: Book and Personal Experience

Becoming Functional is a book written by Joshua Backfield that lays out steps to transform oneself into a functional programmer.


It is a short book (under 150 pages) and is a very enjoyable read.  I did learn a thing or ten.  I refer back to the book whenever I have a suspicion that a certain aspect of my code can be improved.

The most interesting revelation is how my background in hardware design and verification engineers lends well to functional programming.  Functional concepts seem to be come naturally to me.  I am no expert by any means, but looking at Scala code doesn't frighten me anymore.

Thursday, August 3, 2017

TPU Research Notes

Here is a collection of notes from my research on TPU architecture.

In-Datacenter Performance Analysis of a Tensor Processing Unit​ (ISCA 2017)


link

First TPU developed is to accelerate the inference phase of neural networks (not training).  The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit and a large 28 MB software-managed on-chip memory.  An activation unit takes care of the hardware activation functions (ReLU, sin, cos).  One benefit of TPU architecture is deterministic execution model when compared to non-deterministic (time varying) CPU and GPUs.  15X-20X faster than contemporary GPU or CPU and better 30X-80X better TOPS/Watt.

Introduction to NNs

Large data sets in the cloud and availability of compute power enabled renaissance in machine learning.  Use of deep neural networks led to breakthroughs in reducing errors rates of speech recognition, image recognition, etc.

NN imitate brain-like functionality and based on simple artificial neuron.  This "neuron" is a nonlinear function of a weighted sum of its inputs.   The "neurons" are connected in layers.  The "deep" part of DNN comes from going beyond a few layers.  Made possible because because of higher compute capability.

Two phases of NN are called training (or learning) and inference (or prediction).  Today, all training is done in floating point using GPUs.  A step called quantization transforms floating-point numbers into narrow integers (sometimes 8 bit) which are good enough for inference.

Implementation

Designed to be a coprocessor on PCIe bus.  The host sends TPU instructions for the coprocessor to execute.

TPU instructions are sent from the host over PCIe into an instruction buffer.  The main instructions include:

  • Read_Host_Memory:  Read data from CPU host memory into Unified Buffer.
  • Read_Weights:  Reads weights from weight memory into the Weight FIFO as input to Matrix Unit
  • MatrixMultiply/Convolve:  Perform matrix multiply or convolution from Unified Buffers into Accumulators.
  • Activate: Performs the nonlinear function of the artificial neural.  It's inputs are the Accumulators and its output is the Unified Buffer.
  • Write_Host_Memory: Write data from Unified Buffer into CPU host memory
Data flows in through the left and weights are loaded from the top.  Execution of matrix multiplies occurs in a systolic manner (no fetch, excute, writeback sequence).

(Interesting fact. 15% of papers at ISCA 2016 were on hardware accelerators for NN. 9 mentioned CNN, only 2 mentioned NN).

Summary

Results TL;DR they are good.

Increasing memory bandwidth has the bigger performance impact.  Performance up by 3X when memory increases 4X.  Using GDDR5 vs. DDR3 would have improved memory bandwidth by more than a factor of 5.

Clock rate has little benefit with or without more accelerators.  More aggressive design may have increased clock rate by 50%.    Why? MLP (multi-layer perceptron) and LSTMs are memory bound, but only CNN are compute bound.

Thursday, July 27, 2017

TensorFlow Ramp Up Notes

Quick post.  This was written some time ago but cleaned up this morning.  Similar in essence to a previous entry on neural nets. http://blog.edmondcote.com/2017/07/my-neural-network-ramp-up-notes.html

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems


http://download.tensorflow.org/paper/whitepaper2015.pdf

TensorFlow computation is described by directed graph that represents a dataflow computation, with extensions for maintaining/updating persistent state and for branching and looping control.  Each node has 0 or mode inputs and 0 or more outputs, represents and instance of an operation.  Values that flow along normal edges are called Tensors. Special edges called control dependencies can also exist.  No data flows on such edges.

An operation has a name and represents an abstract computation (matrix mult. or add). An operation can have attributes that are provided at graph-construction time. A kernel is a implementation of an operation that can be run on a particular type of device.

The main component of a TensorFlow system are the client, which communicates with the master, and one or more worker processes.  Each worker process is responsible to arbitrating access to 1 or more devices (GPU,CPU,etc.).  The worker process executes a sub graph.  Communication between nodes is achieved using send/receive primitives (RDMA, TCP).

Carefully scheduling of TF operations can result in better performance of the system, specifically with response to data transfers or memory usage (in GPU, memory is scarce).

Pre-existing highly-optimized numerical libraries (BLAS, cuBLAS, etc.) are used to implement kernels for some operations.

Some ML algorithms, including those typically used for training neural networks, are tolerant of noise and reduced precision arithmetic.