Friday, August 25, 2017

Decomposing TileLink2 in RocketChip: A Barebones System

Quick post.  I spent some time yesterday/today hacking at the TileLink2 implementation in RocketChip.

(As a matter of coincidence, SiFive recently released a draft spec of the protocol.)

My preferred method of learning is first to understand the engineer's original intent before enhancing or leveraging code.  This means taking things apart.

The most difficult aspect in getting something up was understanding the inner workings of SystemBus class.  I kept hitting run time assertions and it was helpful to manually sketch out the connections between the TL objects:

Here is a summary of my findings.

I identified the minimal set of configuration knobs for this experiment.

// Barebones system configuration
class BarebonesSystemConfig extends Config((site, here, up) => {
  case XLen => 64
  case BankedL2Params => BankedL2Params(nMemoryChannels = 0,
    nBanksPerChannel = 1,
    coherenceManager = { // FIXME: not really a coherence manager
      case (q, _) =>
        implicit val p = q
        val cork = LazyModule(new TLCacheCork(unsafe = true))
        (cork.node, cork.node)
  case SystemBusParams => SystemBusParams(beatBytes = site(XLen) / 8, blockBytes = site(CacheBlockBytes))
  case PeripheryBusParams => PeripheryBusParams(beatBytes = site(XLen) / 8, blockBytes = site(CacheBlockBytes))
  case MemoryBusParams => MemoryBusParams(beatBytes = 8, blockBytes = site(CacheBlockBytes))
  case CacheBlockBytes => 64
  case DTSTimebase => BigInt(1000000) // 1 MHz
  case DTSModel => "bboneschip"
  case DTSCompat => Nil
  case TLMonitorBuilder => (args: TLMonitorArgs) => Some(LazyModule(new TLMonitor(args)))
  case TLCombinationalCheck => false
  case TLBusDelayProbability => 0.0

This is the system itself.  It extends BaseComplex.  Not RocketComplex.

class BarebonesSystem(implicit p: Parameters) extends BaseCoreplex {
  // connect single master device (fuzzer) to system bus (sbus)
  sbus.fromSyncPorts() :=* LazyModule(new TLFuzzer(1)).node

  // connect single slave device (memory) to the periphery bus (pbus)
  LazyModule(new TLTestRAM(AddressSet(0x0, 0xfff))).node :=* pbus.toFixedWidthSingleBeatSlave(4)

  override lazy val module = new BarebonesSystemModule(this)
class BarebonesSystemModule[T <: BarebonesSystem](_outer: T) extends BaseCoreplexModule(_outer) {}

This should be familiar.  It is your top level Chisel3 module.

// Chisel top module
class Top(implicit val p: Parameters) extends Module {
  val io = new Bundle {
    val success = Bool(OUTPUT)
  val dut = Module(LazyModule(new BarebonesSystem).module)
  dut.reset := reset

Here are the important parts of the system generator.  This will generate your firrtl and verilog output.

// System generator
class System extends HasGeneratorUtilities {
  val names = new ParsedInputNames(
    targetDir = ".",
    topModuleProject = "com.bbones",
    topModuleClass = "Top",
    configProject = "",
    configs = "system"

  val config = new BarebonesSystemConfig
  val params = Parameters.root(config.toInstance)
  val circuit = elaborate(names, params)

  def generateArtefacts: Unit =
    ElaborationArtefacts.files.foreach {
      case (extension, contents) =>
        writeOutputFile(names.targetDir, s"${names.configs}.${extension}", contents())

  def generateFirrtl: Unit =
    Driver.dumpFirrtl(circuit, Some(new File(names.targetDir, s"${names.configs}.fir")))

  def generateVerilog: Unit = {
    import sys.process._
    val firrtl = "/home/edc/gitrepos/rocket-chip/firrtl/utils/bin/firrtl"
    val path = s"${names.targetDir}/${names.configs}"
    val bin = s"$firrtl -i $path.fir -o $path.v -X verilog"


Finally, one should never forget unit testing!

// Unit test
class SystemSpec extends FlatSpec with Matchers {
  "An instance of the system generator " should "compile" in {
    val s = new System

Thursday, August 17, 2017

Becoming Functional: Book and Personal Experience

Becoming Functional is a book written by Joshua Backfield that lays out steps to transform oneself into a functional programmer.

It is a short book (under 150 pages) and is a very enjoyable read.  I did learn a thing or ten.  I refer back to the book whenever I have a suspicion that a certain aspect of my code can be improved.

The most interesting revelation is how my background in hardware design and verification engineers lends well to functional programming.  Functional concepts seem to be come naturally to me.  I am no expert by any means, but looking at Scala code doesn't frighten me anymore.

Thursday, August 3, 2017

TPU Research Notes

Here is a collection of notes from my research on TPU architecture.

In-Datacenter Performance Analysis of a Tensor Processing Unit​ (ISCA 2017)


First TPU developed is to accelerate the inference phase of neural networks (not training).  The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit and a large 28 MB software-managed on-chip memory.  An activation unit takes care of the hardware activation functions (ReLU, sin, cos).  One benefit of TPU architecture is deterministic execution model when compared to non-deterministic (time varying) CPU and GPUs.  15X-20X faster than contemporary GPU or CPU and better 30X-80X better TOPS/Watt.

Introduction to NNs

Large data sets in the cloud and availability of compute power enabled renaissance in machine learning.  Use of deep neural networks led to breakthroughs in reducing errors rates of speech recognition, image recognition, etc.

NN imitate brain-like functionality and based on simple artificial neuron.  This "neuron" is a nonlinear function of a weighted sum of its inputs.   The "neurons" are connected in layers.  The "deep" part of DNN comes from going beyond a few layers.  Made possible because because of higher compute capability.

Two phases of NN are called training (or learning) and inference (or prediction).  Today, all training is done in floating point using GPUs.  A step called quantization transforms floating-point numbers into narrow integers (sometimes 8 bit) which are good enough for inference.


Designed to be a coprocessor on PCIe bus.  The host sends TPU instructions for the coprocessor to execute.

TPU instructions are sent from the host over PCIe into an instruction buffer.  The main instructions include:

  • Read_Host_Memory:  Read data from CPU host memory into Unified Buffer.
  • Read_Weights:  Reads weights from weight memory into the Weight FIFO as input to Matrix Unit
  • MatrixMultiply/Convolve:  Perform matrix multiply or convolution from Unified Buffers into Accumulators.
  • Activate: Performs the nonlinear function of the artificial neural.  It's inputs are the Accumulators and its output is the Unified Buffer.
  • Write_Host_Memory: Write data from Unified Buffer into CPU host memory
Data flows in through the left and weights are loaded from the top.  Execution of matrix multiplies occurs in a systolic manner (no fetch, excute, writeback sequence).

(Interesting fact. 15% of papers at ISCA 2016 were on hardware accelerators for NN. 9 mentioned CNN, only 2 mentioned NN).


Results TL;DR they are good.

Increasing memory bandwidth has the bigger performance impact.  Performance up by 3X when memory increases 4X.  Using GDDR5 vs. DDR3 would have improved memory bandwidth by more than a factor of 5.

Clock rate has little benefit with or without more accelerators.  More aggressive design may have increased clock rate by 50%.    Why? MLP (multi-layer perceptron) and LSTMs are memory bound, but only CNN are compute bound.