New EFLX4K AI eFPGA Core Optimized for Fast, Deep Learning

>10K more gigamacs/second THAN any fpga/efpga!

FPGA chips are in use in many AI applications today including Cloud DataCenters.  (see the next section for a tutorial on AI math)

Embedded FPGA (eFPGA) is now becoming used for AI applications as well.  Our first public customer doing AI with EFLX eFPGA is Harvard University who will present a paper at Hot Chips August 20th on edge AI processing using EFLX: "A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices."

We have other customers whose first question is "how many GigaMACs/second can you execute per square millimeter"?

The EFLX4K DSP core turns out to have as many or generally more DSP MAC's per square millimeter relative to LUTs than other eFPGA and FPGA offerings (For example, the Xilinx VU13P has 1 DSP for every 300 LUT4s; EFLX4K DSP has 1 DSP for every 75 LUT4s), but the MAC was designed for digital signal processing and is overkill for AI requirements.  AI doesn't need a 22x22 multiplier and doesn't need pre-adders or some of the other logic in the DSP MAC.

EFLX4K AI core.png

In response to customer requests we have architected an new member of the EFLX4K family, the EFLX4K AI core, optimized for deep learning, which has >10x the GigaMACs/second per square millimeter of the EFLX4K DSP core!  The EFLX4K AI core can be implemented on any process node in 6-8 months on customer demand and can be arrayed interchangeably with the EFLX4K Logic/DSP cores.

A single EFLX4K AI core has the same number of inputs/outputs of all cores in the EFLX4K family: 632 in and 632 out, each with an optional flip-flop.

The EFLX4K AI core has 8-bit MACs (8x8 multipliers with accumulators) which can also be configured as 16-bit MACs, 16x8 MACs or 8x16 MACs as required, reconfigurable.  Each core has 441 8-bit MACs which can run ~1Ghz at worst case conditions (125C Tj, 0.72Vj, slow-slow corner) for ~441 GMacs/second for each EFLX core.  This compares to 40 MACs at ~700MHz at worst case conditions for the EFLX4 DPS core which is 28GMacs/second.  The EFLX AI core has >10x the MACs/second per square millimeter!

The EFLX4K AI core is the same width as the EFLX4K Logic/DSP cores and ~1.2x the height.  A 7x7 EFLX4K AI array has >20 TeraMACs/second at worst case operating conditions. A 4x7 array of EFLX4K AI cores has more MACs than the largest Xilinx FPGA (which is probably multiple die in one package) but fits in 28 square millimeters.  A EFLX4K AI cores can be arrayed up to at least 7x7.  And they can be mixed interchangeably, by row, with EFLX4K Logic/DSP cores.  A customer can design an EFLX array with the number of MACs and amount of control logic required for their neural network applications.  

A target spec for the EFLX4K AI core can be downloaded HERE.  This target spec is in discussion with customers and may change based on customer inputs and requirements.  

Basics of Neural Network Math Operations

Below is a very simple neural network graph.  The input layer is what the neural network will process.  For example, if the input layer were a 1024x768 picture, there would be 1024x768 = 786,432 inputs each with an R, G and B component!  The output layer is the result of the neural network: perhaps the neural network is set up to recognize a dog versus a cat versus a car versus a truck.  The hidden layers are the steps required to go from the raw input to achieve a high confidence output: typically there are many more layers than this.

neural network color.jpg

What are all the lines between the circles?  A Neural Network is an approximation of the neurons in a human brain which receive inputs from dozens or hundreds of other neurons then generate their own output.  In the example above, the first hidden layer has 7 "neurons": each neuron receives a "signal" or input from 5 inputs of the input layer.  

Mathematically, the hidden layer neuron value is computed as follows: [input1*weight1n + input2*weight2n + input3*weight3n + input4*weight4n + input5*weight5n] -- see the red highlighted vectors to the right --  then this value is passed through an activation function which generates the final result for the first hidden layer neuron.  

Screenshot 2018-06-21 10.55.52.png

Converting all of the inputs to the first hidden layer can be represented then as a matrix multiply of the input vector times a matrix of weights.  In the matrix multiply to the right, x is the input layer vector, A is the weights matrix and the result is the value of the hidden layer #1 which is then fed through the activation function.

In neural networks there are two phases: a training phase where the neural network is "trained" to produce the appropriate desired outputs from the inputs.  Typically training is done using GPU and floating point math: training requires a very large database of inputs and very large processing power to achieve a neural network that can achieve the desired purpose.  It is the training phase which generates the weights.  For the neural network above, there is a matrix of weights for each layer.

Once the weights are generated, the neural network can be used to classify inputs: this is called inference.  Inference is done using integer math with 16-bit value, 8-bit values or even less.  

matrix multiply partitioning.png

For each layer, there is a large matrix multiplication that takes place followed by an activation function operation.  The mathematical operation that dominates is the matrix multiply, so that is why we often hear the question "how many GigaMACs/second can you do?".  The matrix sizes and weights can be very large: millions of entries or 10's of millions.  The hardware is not going to map the neural network of this size one-for-one, it would be too big.  Instead, large matrix multiplies can be done by a series of smaller block matrix multiplies, which themselves can be done as a series of vector multiplies row times column, as shown below.

In the EFLX4K AI, the MACs are arranged in rows with one MAC having direct pipeline connection to the MACs on either side enabling a rapid multiplication/accumulation series which is equivalent to the row times column vector multiply above.  The MACs can pipeline as well "jumping" from one EFLX AI core to the next for very long vector multiplies.

The biggest value of eFPGA in AI is the ability to reconfigure: different neural networks have different configurations, and algorithms are changing rapidly so the ability to evolve the hardware configuration is huge.

This is a simple summary of the matrix math in neural networks but serves to highlight the value of dense, fast, pipelined MACs in the EFLX AI.