Wednesday, September 27, 2017

Reflecting on my birthday

Reflecting on my 36th birthday this morning.  I feel incredibly grateful both in my personal and professional life.

Focusing on the latter, this is a professional blog after all, the news that Sun/Oracle's SPARC program was shut down brought up some thoughts.  I started my career on that team.  They were doing interesting work.  The challenges were both technical and non technical.  Getting things done required less coding and more engineering; the aspects of the profession that are not taught in school (e.g. sourcing requirements, gathering data, engaging with partners, and establishing consensus).  I recall the exit interview with my then director.  What stuck in the conversation was the level of pride that the director had in having assembled a team family, many of which had already been working together for 10+ years.  Maybe it was a retention tactic (ok, of course it was!), but I was extended a very genuine statement that I was valued in the team family and a promise, that within this environment, I would continue to grow, etc.  Obviously, I didn't recognize any of this at the time.  Many of the my ex-colleagues did remain in that team and, post layoff, I think back to the growth potential I would of had in that organization.  I've since been happy to refer a handful of these people to my connections.

My life (including personal life) may have been very different if I continued on that path.  Alas, I made a different decision.  For a short while, I chased happiness at work, but over the past 5+ years, my focus changed to becoming my best self and doing my best work.  This has never been more true than today.  I am grateful that this endeavor is the third most important area in my life (following my self care, and wife/family).

Tuesday, September 26, 2017

Generate custom Chisel bundle using meta programming

I found the example to extend a Bundle's functionality limiting.  You need to pass a varargs of (String,Data) tuple to the function.  Here's a snippet:

 final class CustomBundle(elts: (String, Data)*) extends Record {
    val elements = ListMap(elts map { case (field, elt) => field -> elt.chiselCloneType }: _*)
    def apply(elt: String): Data = elements(elt)
    override def cloneType = (new CustomBundle(elements.toList: _*)).asInstanceOf[this.type]
  }

I could have found a solution using the above (not sure, is it possible to convert ListMap to varargs?), but I dug into Scala meta programming for the purpose of ramp.  Behold!  Here's an alternative (though incomplete) method of generating a bundle.

object MetaBundle {
  import reflect.runtime.currentMirror
  import tools.reflect.ToolBox
  val toolbox = currentMirror.mkToolBox()
  import toolbox.u._

  def apply(): Bundle = {
    val tree =
      q"""
          import chisel3.core._
          val b = new Bundle {
            // placeholder for your interface
            val i = Input(UInt(1.W))
            val o = Output(UInt(1.W))
          }
          b
       """    
     toolbox.eval(tree).asInstanceOf[Bundle]
  }
}

The code above returns an instance of a typeless bundle.  The result isn't equivalent to above, but serves my purpose.  Have yet to try, but suspect the technique can be extended to other cases.  There's likely implications wrt. statically typed nature of Scala.  Use at your own risk.  I am only posting this because it's a neat hack.

Monday, September 25, 2017

Driving random values on all elements of a bundle

Quick addition to the previous post.  This technique is useful when prototyping hardware.  Use LFSRs to drive pins/ports such that no logic pruning occurs its cone of logic during synthesis.

  final def prng(bundle: Bundle): Unit = {
    bundle.elements.filter(_._2.dir == OUTPUT).foreach {
      case (n, d) => {
        val r = Module(new LFSR(d.getWidth))
        when(true.B) {
          d := r.io.y
  } } } }

Example function to enhance Chisel bundle connection

Here's a real world example that demonstrates the flexibility of Chisel (or DSLs in general) compared to traditional languages for describing hardware.

It is my opinion that SystemVerilog's promise of making module connection more user (and verification) friendly has not been realized.  The interface construct was a good first step, so were interface modports.  The .* connection operator or parameterizable interfaces, not so much.  Traditional teams send their best (aka. quickest) Perl monkey to the rescue and, just like that, a new target for feature creep and "I'm busy, fix the script yourself" is created.

You have access to an API when using an internal DSL approach (such as Chisel).  If you wanted to write your own function to connect to arbitrary hardware interfaces - you do it.  This doesn't directly eliminate feature creep, but provides a better framework for development than Perl/RegEx can offer.  Consider this tradeoff.  How many lines of parsing, data structure definition, testing and software regression strategies are needed compared to a few lines of Scala.

Here's an example from my private repo.  It takes two Bundles as input and connects fields with matching names.

final def connect(left: Bundle, right: Bundle)(implicit sourceInfo: SourceInfo, connectionCompileOptions: CompileOptions): Unit = {
  (left.elements.toSeq ++ right.elements.toSeq) // Map to Seq, then combine lhs and rhs
    .groupBy(_._1) // group by name of
    .filter(_._2.length == 2) // filter out non matches
    .foreach {
      case (k, v) => {
        (lhs.dir, rhs.dir) match {
          case (NODIR, NODIR) => attach(lhs.asInstanceOf[Analog], rhs.asInstanceOf[Analog]) // to support INOUT ports
          case (INPUT, OUTPUT) => rhs := lhs
          case (OUTPUT, INPUT) => lhs := rhs
          case (INPUT, INPUT) => rhs := lhs // TODO: verify more
          case (OUTPUT, OUTPUT) => lhs := rhs // TODO: verify mode
        }
      }
    }

PS. Slowly, but surely, I'm becoming functional.


Its difficult for the time being, especially without a IDE.  IntelliJ IDEA is terrific.  Still can't properly explain a Monad.  The ability to highlight a val, then hit "Alt-Enter", select "Add Type annotation to value definition", provides immeasurable value.  Here's an example.  By breaking up different parts of the above function, I can better determine (before compile time) the type of data structure that will be output.


    val a: Map[String, Data] = left.elements
    val b: Map[String, Data] = right.elements
    val c: Seq[(String, Data)] = a.toSeq
    val d: Seq[(String, Data)] = a.toSeq ++ b.toSeq
    val e: Map[String, Seq[(String, Data)]] = d.groupBy(_._1)

Saturday, September 16, 2017

Notes on Machine Learning Hardware

I previously summarized the original TPU design here.

Nervana Systems

https://www.nextplatform.com/2016/08/08/deep-learning-chip-upstart-set-take-gpus-task/

Company is still not willing to give details about the processing elements and interconnect.  There is no floating point element ("flexpoint").  The interconnect extends to high speed serial links.  From a programming perspective, it looks the same to talk between chips or to different units on a single chip.  The architecture is completely non coherent and there are no caches, only software managed resources.   Massage passing is used to communicate between chips.  This is unbelievably similar to the processor I helped architect and implement for REX which, in itself, has roots to Transputer concepts.

"Here is a piece of the matrix and I need to get it to the other chip". Does this mean that the systolic nature of matrix mult. workloads are controlled via microcode?


Neurosteam

https://www.nextplatform.com/2017/02/02/memory-core-new-deep-learning-research-chip/
https://arxiv.org/pdf/1701.06420.pdf

Neurosteam is a processor-in-memory solution. "Smart Memory Cubes" combined HMC and compute on single die (NeuroCluster).   Compute consists of 8% of die area of standard HMC.  NeuroCluster is flexible many-core platform based on efficient RISC-V PEs (4 pipe stages, in order) and NeuroSteam  co-processors (NST).  No caches (except I$), but scratchpad and DMA engines are used.  TLB and simple virtual memory.  NST is SIMD accelerator that works directly off the scratchpad memory.  Achieved 1 MAC/cycle. 128 instances of MST at 1 GHz, sum to 256 GFLOPS.

Info on software stack here.  They have a gem5 based system simulator built on Cortex-A15.

Graphcore

https://www.nextplatform.com/2017/03/09/early-look-startup-graphcores-deep-learning-chip/

At core of IPU is a graph processing base.  Underlaying machine learning workloads is a computational graph with vertex representing some compute function on a set of edges that are representing data with weights associated.

The entire neural network models fits in the processor (not in the memory).  "We are trying to map a graph to what is in effect a graph processor".  The key is in software that allows us to take these complex structures and make to a highly parallel processor with all the memory we need to hold the model inside the processor.  Neural networks are expanded into a graph.  The software maps the graph to a relatively simple processor with interconnect that is controlled entirely by the compiler.

Wave Computing

https://www.nextplatform.com/2016/09/07/next-wave-deep-learning-architectures/
https://www.bdti.com/InsideDSP/2016/11/17/WaveComputing

Dataflow architecture (DPU).  16k independent processors (8 bit RISC-oriented).  At core of approach is use of fixed points with stochastic rounding techniques and many small compute elements for high parallelism.