Edmond Cote's Blog: Lifelong learning

Showing posts with label Lifelong learning. Show all posts

Monday, September 30, 2024

"Fundamentals of Digital Image and Video Processing" - A Course Review

I finally completed the "Fundamentals of Digital Image and Video Processing" course offered by Northwestern University on Coursera. Despite having a background in signal processing and experience with image/video processing, I found the course to be quite underwhelming. In truth, I skimmed through a good 30-40% of the material and quizzes.

The course material was dense and often felt like it glossed over important concepts. The instructor's style, with slides packed with information, didn't resonate with me. It seemed a better fit for those already familiar with the subject matter rather than your typical e-learner. Maybe if I were still a student and had fellow students to commiserate through the pain?

While the quizzes were decent, the questions appropriately vague to force you to think rather than look up answers, the programming assignments were underwhelming. Look elsewhere if you're looking to learn about digital image and video processing.

Thursday, May 10, 2018

Another Scala Cheatsheet

I recently wrote up a "cheat sheet" for Scala.

You can download a .pdf copy of the file here.

My office is coming together nicely thanks to "decorations" as such.

References

https://alvinalexander.com/downloads/scala/Scala-Cheat-Sheet-devdaily.pdf (75% “copied” from his material)
https://alvinalexander.com/scala/how-to-define-use-partial-functions-in-scala-syntax-examples
http://blog.originate.com/blog/2014/06/15/idiomatic-scala-your-options-do-not-match/ (see: The Ultimate Scala Option Cheat Sheet)

Saturday, September 16, 2017

Notes on Machine Learning Hardware

I previously summarized the original TPU design here.

Nervana Systems

https://www.nextplatform.com/2016/08/08/deep-learning-chip-upstart-set-take-gpus-task/

Company is still not willing to give details about the processing elements and interconnect. There is no floating point element ("flexpoint"). The interconnect extends to high speed serial links. From a programming perspective, it looks the same to talk between chips or to different units on a single chip. The architecture is completely non coherent and there are no caches, only software managed resources. Massage passing is used to communicate between chips. This is unbelievably similar to the processor I helped architect and implement for REX which, in itself, has roots to Transputer concepts.

"Here is a piece of the matrix and I need to get it to the other chip". Does this mean that the systolic nature of matrix mult. workloads are controlled via microcode?

Neurosteam

https://www.nextplatform.com/2017/02/02/memory-core-new-deep-learning-research-chip/
https://arxiv.org/pdf/1701.06420.pdf

Neurosteam is a processor-in-memory solution. "Smart Memory Cubes" combined HMC and compute on single die (NeuroCluster). Compute consists of 8% of die area of standard HMC. NeuroCluster is flexible many-core platform based on efficient RISC-V PEs (4 pipe stages, in order) and NeuroSteam co-processors (NST). No caches (except I$), but scratchpad and DMA engines are used. TLB and simple virtual memory. NST is SIMD accelerator that works directly off the scratchpad memory. Achieved 1 MAC/cycle. 128 instances of MST at 1 GHz, sum to 256 GFLOPS.

Info on software stack here. They have a gem5 based system simulator built on Cortex-A15.

Graphcore

https://www.nextplatform.com/2017/03/09/early-look-startup-graphcores-deep-learning-chip/

At core of IPU is a graph processing base. Underlaying machine learning workloads is a computational graph with vertex representing some compute function on a set of edges that are representing data with weights associated.

The entire neural network models fits in the processor (not in the memory). "We are trying to map a graph to what is in effect a graph processor". The key is in software that allows us to take these complex structures and make to a highly parallel processor with all the memory we need to hold the model inside the processor. Neural networks are expanded into a graph. The software maps the graph to a relatively simple processor with interconnect that is controlled entirely by the compiler.

Wave Computing

https://www.nextplatform.com/2016/09/07/next-wave-deep-learning-architectures/
https://www.bdti.com/InsideDSP/2016/11/17/WaveComputing

Dataflow architecture (DPU). 16k independent processors (8 bit RISC-oriented). At core of approach is use of fixed points with stochastic rounding techniques and many small compute elements for high parallelism.

Thursday, August 3, 2017

TPU Research Notes

Here is a collection of notes from my research on TPU architecture.

In-Datacenter Performance Analysis of a Tensor Processing Unit (ISCA 2017)

link

First TPU developed is to accelerate the inference phase of neural networks (not training). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit and a large 28 MB software-managed on-chip memory. An activation unit takes care of the hardware activation functions (ReLU, sin, cos). One benefit of TPU architecture is deterministic execution model when compared to non-deterministic (time varying) CPU and GPUs. 15X-20X faster than contemporary GPU or CPU and better 30X-80X better TOPS/Watt.

Introduction to NNs

Large data sets in the cloud and availability of compute power enabled renaissance in machine learning. Use of deep neural networks led to breakthroughs in reducing errors rates of speech recognition, image recognition, etc.

NN imitate brain-like functionality and based on simple artificial neuron. This "neuron" is a nonlinear function of a weighted sum of its inputs. The "neurons" are connected in layers. The "deep" part of DNN comes from going beyond a few layers. Made possible because because of higher compute capability.

Two phases of NN are called training (or learning) and inference (or prediction). Today, all training is done in floating point using GPUs. A step called quantization transforms floating-point numbers into narrow integers (sometimes 8 bit) which are good enough for inference.

Implementation

Designed to be a coprocessor on PCIe bus. The host sends TPU instructions for the coprocessor to execute.

TPU instructions are sent from the host over PCIe into an instruction buffer. The main instructions include:

Read_Host_Memory: Read data from CPU host memory into Unified Buffer.
Read_Weights: Reads weights from weight memory into the Weight FIFO as input to Matrix Unit
MatrixMultiply/Convolve: Perform matrix multiply or convolution from Unified Buffers into Accumulators.
Activate: Performs the nonlinear function of the artificial neural. It's inputs are the Accumulators and its output is the Unified Buffer.
Write_Host_Memory: Write data from Unified Buffer into CPU host memory

Data flows in through the left and weights are loaded from the top. Execution of matrix multiplies occurs in a systolic manner (no fetch, excute, writeback sequence).

(Interesting fact. 15% of papers at ISCA 2016 were on hardware accelerators for NN. 9 mentioned CNN, only 2 mentioned NN).

Summary

Results TL;DR they are good.

Increasing memory bandwidth has the bigger performance impact. Performance up by 3X when memory increases 4X. Using GDDR5 vs. DDR3 would have improved memory bandwidth by more than a factor of 5.

Clock rate has little benefit with or without more accelerators. More aggressive design may have increased clock rate by 50%. Why? MLP (multi-layer perceptron) and LSTMs are memory bound, but only CNN are compute bound.

Thursday, July 27, 2017

TensorFlow Ramp Up Notes

Quick post. This was written some time ago but cleaned up this morning. Similar in essence to a previous entry on neural nets. http://blog.edmondcote.com/2017/07/my-neural-network-ramp-up-notes.html

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

http://download.tensorflow.org/paper/whitepaper2015.pdf

TensorFlow computation is described by directed graph that represents a dataflow computation, with extensions for maintaining/updating persistent state and for branching and looping control. Each node has 0 or mode inputs and 0 or more outputs, represents and instance of an operation. Values that flow along normal edges are called Tensors. Special edges called control dependencies can also exist. No data flows on such edges.

An operation has a name and represents an abstract computation (matrix mult. or add). An operation can have attributes that are provided at graph-construction time. A kernel is a implementation of an operation that can be run on a particular type of device.

The main component of a TensorFlow system are the client, which communicates with the master, and one or more worker processes. Each worker process is responsible to arbitrating access to 1 or more devices (GPU,CPU,etc.). The worker process executes a sub graph. Communication between nodes is achieved using send/receive primitives (RDMA, TCP).

Carefully scheduling of TF operations can result in better performance of the system, specifically with response to data transfers or memory usage (in GPU, memory is scarce).

Pre-existing highly-optimized numerical libraries (BLAS, cuBLAS, etc.) are used to implement kernels for some operations.

Some ML algorithms, including those typically used for training neural networks, are tolerant of noise and reduced precision arithmetic.

Monday, July 24, 2017

Neural Network Ramp Up Notes

This post will list a number of resources related to neural networks and provide a summary of important points. Please consult the source articles to learn more. The notes below are merely a means for study. In fact, entire sentences are lifted from the source material.

A Quick Introduction to Neural Networks

https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/

An artificial neural network is a computational model that is inspired by biological neural networks.
The basic unit of computation in a neural network is the neuron (or node, unit). It receives input from other notes and computes an output. Each input has an associated weight (w) which is assigned on the basis of its relative important to the other inputs. The node applies a function (f) to the weighted sum of its inputs. The function f is non linear and is called the activation function. There is also another input 1 with weight b (called the Bias).

Output of neuron Y=f(w1.X1 +w2.X2 + b)

Every action function takes a single number and performs a certain fixed mathematical operation on it. There are several action functions that are encountered in practice: sigmoid, tanh, ReLU.

The main function of Bias is to provide every node with a a trainable constant value (in addition to the normal inputs that the node receives).

A feedforward neural network is the simplest. The network can consist of three types of nodes: input, hidden, and output nodes. Data flows in only one direction. There are no cycles or loops. A Multi Layer Perceptron (MLP) contains one or more hidden layers. An MLP can learn the relationships between the features (x1, x2) and target y.

Example: Hours studied to mid term marks. This is known as a binary classification problem.

The process by which a MLP learns is called the Backpropagation (BackProp) algorithm. BackProp is like "learning from mistakes. The supervisor corrects the ANN whenever it makes a mistake.
In supervised learning, the training set is labeled. This means, for some given inputs, we know the desired/expected output (label).

For BackProp, initially all the edge weights are randomly assigned. For every input in the training dataset, the ANN is activated and its output is observed. The output is compared with the desired output that we already know, and the error is "propagated" back to the previous layer. This error is notes and the weights are "adjusted" accordingly. This process is repeated until the output error is below a predetermined threshold.

Once the above algorithm terminates, we have a "learned" ANN.

See:
https://www.quora.com/How-do-you-explain-back-propagation-algorithm-to-a-beginner-in-neural-network/answer/Hemanth-Kumar-Mantri for more information on the back propagation algorithm.

Using CNNs To Speed Up Systems

http://semiengineering.com/using-cnns-to-speed-up-systems

Convolutional neural networks (CNNs) are becoming one of the key differentiators in system performance. The most important power impact from the CNNs is for the ADAS (advanced driver assistance system) application. Power is critical factor for CNNs. The key part of a CNN is matrix multiplication. So far, it is not clear what is the best platform for this (SIMD, FPGA, DSPs, GPU). There are significant power and performance tradeoffs associated with CNNs.

At its most basic, CNNs are a combination of MAC operations and memory handling. "You need to be able to get the data in and out quickly so that you don't stave your MAC".

Is there a difference between NN and CNNs?

https://www.quora.com/Is-there-a-difference-between-neural-networks-and-convolutional-neural-networks

NN is a generic name for a large class of machine learning algorithms. Most of the algorithms are trained with back propagation.

In late 1980s, early 1990s, the dominating algorithm in neural nets (and machine learning in general) was fully connected neural networks. These algorithms have a large number of parameters and do not scale well. Then comes CNNs which is not a fully connected network and where neurons share weights. These types of neural nets have been proven successfully especially in the fields of computer vision and natural language recognition.

Neural Networks and Deep Learning Book

http://neuralnetworksanddeeplearning.com/index.html

Chapter 1

Using neural nets to recognize handwritten digits. Idea is to take large number of handwritten digits, known as training examples and develop a system that can learn from those training examples.

Perceptron takes binary inputs and produces single binary output. The inputs are weighted. Dot product of weights and inputs, then use as input to activation/non-linear function (this is sigmoid?). Example: Go to cheese festival? What factors: x1 is weather good, x2 does girlfriend want to go, etc. Weights applies to each decision factor.

For handwriting recognition problem. 28x28 pixel = 784 input neurons. Output layer is 0-9 indicating the result.

(I am omitting all the math in this chapter)

Here's how the program presented in the chapter works:

Loads MNIST training data set
Set up neural network: 784 input neurons, 30 in second layer, 10 at output
Use stochastic gradient descent to learn from MNIST training data over 30 epocs, mini-batch size of 10, and learning rate of 3.0

stochastic gradient descent is the standard learning algorithm for neural networks
mini-batch random subset of training data?
Once we exhausted all training inputs we've completed an epoch of training
Increasing learning rate to improve results

Deep learning means more layers.

Chapter 2

Backprop was originally introduced in 1970s. Mathematically intensive. Backprop is the workhorse of learning in neural networks. At heart of backprop is an expression for the partial derivate of the cost function with respect to any weight of the network. The expression tells us how quickly the cost changes when we change the weights and biases. It gives insights into how changing the weights and biases changes the overall behavior of the network.

Monday, May 8, 2017

Play and learn

In case anyone stumbles on this blog, this image describes what I'm doing ..

As does this ..

Wednesday, September 17, 2014

Wednesday Night Hack #5 - Update

Life happened. We bought a house in Silicon Valley. Work ramped up, code freeze and verification complete of next generation Adreno GPU is on the horizon (being careful to be intentionally vague here). Thoroughly enjoyed the end of summer by camping and even attended my first personal development workshop weekend.

Last week, I checked in some code to my GIT repository https://bitbucket.org/ecote. I released an updated version of my build and run flow (barf) and associated workspace generator (wsgen). An example BARF script has also been released: analyze.barf. At the moment, I am not motivated to document the operation of the code or describe any use cases. The main reason for this is because I do not know the right forum to publish the work (blog, conference paper/presentation, YouTube, etc.). In any case, these small pet projects are about having fun, no rush ...

Tuesday, June 17, 2014

Two New Technical Books

I have two new technical books on their way to bookshelf :

I read through the table of contents of the first book and couldn't pass up its sub-$40 price tag (and 728 pages). I'm especially interested in the second book. PCB design is not a sexy field, but I've always been interested in creating something I can touch. I've researched and found TechShop in San Jose has some equipment that can help support that experiment. Likely, it's a far cry from the equipment that I could get access to at work, but alas, I want to keep my day job. I have no concrete plans for a circuit board project right now, but maybe one day ...