Read about the new NMAX technical information
Cheng Wang disclosed more technical information on NMAX at the Edge AI Summit on how the NMAX dataflow architecture works. See his complete slide presentation HERE. The new information is slide 7 and 15+. We’ll release more technical information on or before the April Linley Processor Spring Conference. Under NDA, we have power, performance, area estimates to share with serious strategic customers. Contact us at firstname.lastname@example.org.
NMAX™ Neural Inferencing for Edge and Data Center
NMAX neural inferencing is:
modular from 1 to >100 TOPS (worst case conditions, TSMC16FFC/12FFC)
scalable: as you double the silicon area, you double the throughput in TOPS (it is throughput that matters),
low latency: NMAX loads weights fast, so performance at batch = 1 is usually as good as large batch sizes; this is critical for edge applications,
low cost: NMAX uses the MACs with 60-90% utilization, whereas existing solutions are often <25%. This means NMAX gets more throughput out of less silicon area,
low power: NMAX uses on-chip SRAM very efficiently to generate high bandwidth so we need little DRAM. Data Center class performance is achievable with 1 LPDDR4 DRAM for ResNet-50 and 2 for YOLOv3,
able to run any kind of neural network or multiple at once,
programmed using Tensorflow or Caffe.
Cheng Wang’s overview presentation on NMAX, given October 31st at the Linley Processor Conference, can be seen HERE. (Cheng will give another presentation with additional information disclosure, at the Edge AI Summit, 11 December 2018 in San Francisco: register at www.edgeaisummit.com)
A 4-page overview of NMAX in PDF form can be viewed and downloaded HERE.
Below you will find 1) NMAX performance for ResNet-50, 2) NMAX performance for YOLOv3 real time object recognition, 3) NMAX architecture and 4) NMAX Compiler.
If you want more information than is on this page or if you don’t see the neural network you are interested in, contact us at email@example.com. We can disclose more details of silicon area and power estimates under NDA to serious customers with neural network applications.
You can read Microprocessor Report’s article on NMAX HERE.
NMAX has a unique new architecture that loads weights rapidly compared to existing solutions. Microsoft at Hot Chips 2018 showed the following slide. NMAX performance is similar to the IDEAL labeled below - it is IDEAL because you want highest throughput WITH lowest latency at lowest cost (high utilization = low cost because you need less hardware).
NMAX Performance on ResNet-50 Image Classification
ResNet-50 is a 50-stage Neural Network model that classifies 224 x 224 images. It has 22.7 Million weights: every weight must be loaded for every image classified. Each image requires 3.5 billion MACs = 7 billion operations (1 MAC = 2 operations, 1 multiply and 1 accumulate).
NMAX is modular and scalable. Below we show how NMAX (using TSMC16FFC, Slow/Slow process corner, 0.72V, 125C Tjunction) compares at the high end against existing data center inferencing solutions. NMAX 6x6/32MB means a 6x6 NMAX array (see the architecture section below) with 32MB of on-chip SRAM; then NMAX performance is based on 16nm silicon. Tesla T4 (12nm) and Goya (process unknown) have on-chip memory too but how much is not specified in their documentation.
The graph below uses the same information. It’s clear that Habana Goya’s performance drops ~50% from Batch=5 or 10 to Batch = 1. NMAX’ performance drops too but only about 5% for 6x6 and a little over 10% for 12x12. In the DataCenter this may not be an issue. But for edge applications, where there is one sensor/camera and things must happen real time, processing must be at batch = 1. NMAX’ performance is extremely well suited for edge applications where cost is also critical.
Notice several important characteristics of NMAX that this table illustrates
NMAX achieves high throughput for batch = 1. In the data center batching is an acceptable solution. On the edge there may only be 1 camera/sensor or 2 or 4. So the column that matters for the edge is batch = 1. NMAX achieves this because weights are loaded quickly so MACs don’t “stall”. Latency is very low because latency when running at batch=1 is the inverse of the images/second.
NMAX achieves high throughput by having very high MAC utilization (>85%); solutions with low MAC utilization need to have more MACs and more silicon area for the same throughput
NMAX achieves high throughput with just 1 LPDDR4 x32 DRAM; existing solutions require 8. DRAMs are expensive and burn most of the 75-100W required by the existing solutions. Wide DRAM buses also require larger PCB, lots of DDR PHY silicon area and >>500 additional BGA balls, all of which drive up cost and size. At the same performance, NMAX will use much less power.
NMAX is scalable: ResNet-50 throughput approximately doubles from 6x6 to 6x12 and again to 12x12 - so performance scales linearly with NMAX silicon area. This also works in the other direction: a 1x1 NMAX array classifies 111 images/second.
NMAX is modular, contact us at firstname.lastname@example.org to learn about the NMAX configuration that meets your throughput requirement.
NMAX Performance on YOLOv3 Real Time Object Recognition
YOLOv3 is a Neural Network model, with more than 100 stages, that does real time object recognition. It can process images of any size. It has 62 Million weights For high resolution, 2 Megapixels (2048 x 1024), each image requires 400 billion MACs = 800 billion operations! This is 100x the computational requirement for ResNet-50!
Real time object recognition is of high interest for a range of applications such as: surveillance cameras that can recognize people and differentiate between family/friends vs strangers; backup cameras that want to avoid kids and bikes; and autonomous driving at highway speeds.
Below is a table of interest to companies doing autonomous driving. The 6x6 NMAX array column runs YOLOv3 on 2 Megapixel images for 1 camera at ~30 frames/second; 12x6 for two cameras; 12x12 for four cameras. We can’t compare to others, because no one else has provided YOLOv3 performance. The conditions for NMAX in this table are: batch=1, YOLOv3, 2 Megapixel RGB images, TSMC16FFC, Slow/Slow process corner, 0.72V, 125C Tjunction.
Things to note in this table:
some of the configurations require more memory for optimal performance than ResNet-50: YOLOv3 is a bigger model,
the throughput is roughly linear: about doubles for every doubling of NMAX resources. NMAX is a scalable architecture,
MAC utilization is again very high at 50-70% for batch=1 (utilization is less than ResNet-50 because there are more weights and the images being processed are much larger),
DRAM bandwidth is very low; other architectures require 100’s of GBs/second and 4x or more DRAMs. This is possible because SRAM bandwidth is huge: this is due to the distributed nature of processing and the interconnect technologies.
Again, NMAX is scalable to lower performance too: a 1x1 NMAX array processes 1 frame/second, perhaps sufficient for applications like backup or front door surveillance cameras. NMAX is modular, contact us at email@example.com to learn about the NMAX configuration that meets your throughput requirement.
If you want a different image size performance scales: for example, 1 Megapixel images would process at twice the frames/second of 2 Megapixel images.
NMAX is programmed using Tensorflow/Caffe, not Verilog, to do matrix-intensive operations especially neural networks.
NMAX uses all of the technologies Flex Logix has developed in over 4 years for EFLX eFPGA. “Under the hood” it IS an eFPGA, but optimized for neural network inferencing. Since you program it in Tensorflow or Caffe, it is like other neural inferencing accelerators (if you wanted to, you could program NMAX in Verilog like FPGAs, but no customer has asked for that yet).
XFLX interconnect programmable connects thousands of inputs to thousands of outputs with full connectivity, small area and high speed,
ArrayLinx interconnect allows “tiles” of EFLX or NMAX to create arbitrarily large arrays by automatically creating a top level mesh interconnect allowing any tile to be programmable connected with any other,
RAMLinx interconnect allows tiles to connect to SRAM blocks located between rows of tiles,
NMAX clusters of 8x8 MACs with 32 bit accumulators (like the DSP blocks in our eFPGA but optimized for NNs),
eFPGA programmable logic for control logic, management, reconfiguring data flow and functions like activation of any kind.
NMAX is being implemented first in TSMC16FFC/12FFC so we can re-use silicon proven circuit blocks for rapid time to market.
The NMAX “building block” is the NMAX512 tile, shown on the right. It will be <2 mm2 in 16nm, perhaps significantly less: exact current area estimates are available under NDA. It is called NMAX512 because there are 512 each of the 8x8 MACs, with 32 bit accumulate, organized into eight clusters of 64 each.
Note: this is an architectural diagram and not to scale. For example, the XFLX interconnect is not a distinct block but is spread throughout the tile. The XFLX interconnect programmable interconnects all of the blocks within a tile as needed, stage by stage.
L1 SRAM is local SRAM used for weights and activations. L1 SRAM is positioned for rapid weight loading.
L2 SRAM is connected through the EFLX IO via RAMLinx.
ArrayLinx, on each side of the tile, connects to adjacent NMAX tiles to create larger NMAX arrays by abutment.
The NMAX512 can operate as a complete neural inferencing engine.
For higher throughput, we abut NMAX tiles and L2 SRAM, of varying sizes as needed to optimize for the target application. An example is shown here of a 2x2 NMAX array.
The tiles communicate via ArrayLinx, the blue arrows, which create an array-wide programmable mesh interconnect. Where there is L2 SRAM, ArrayLinx is routed over top of them.
The L2 SRAM holds weights for each layer as well as activations from one layer to the next.
Our place-and-route algorithms, developed and used extensively for our EFLX eFPGA, minimize the interconnect distances between SRAM and NMAX.
NMAX’ architecture distributes processing across the array with the bulk of memory references local to the processing. This architecture minimizes power dissipation compared to Von Neumann architectures with centralized SRAM and x256-bit-bus off-chip DRAM.
For more details contact us at firstname.lastname@example.org to discuss your needs under NDA.
NMAX can implement any NN or run multiple concurrently.
NMAX is the ideal inferencing architecture: processing and SRAM are distributed and local. High performance is achieved with low latency, because NMAX loads weights quickly. Array sizes are totally modular to fit the needs of varying applications. L2 SRAM sizes are variable for optimizing to different NN models. Silicon area is minimized because MAC utilization is very high. DRAM bandwidth is slashed because on chip SRAM bandwidth is huge due to our interconnect technology.
1/3 of our technical team are software developers and we are already implementing the NMAX compiler.
Tensorflow and Caffe program NMAX: they are neural network model description languages. NNs are data flow. So is NMAX. The NN “unrolls” onto the NMAX hardware; and control logic/data operators map on to EFLX eFPGA reconfigurable logic.
Power and array details and other implementation information are available under NDA now to strategic partners.
Silicon area and power specifications, and additional technical disclosures, will be made public in at the Linley Spring Processor Conference in April 2019.
All of the IP deliverables for any size NMAX array will be available in mid 2019 for TSMC16FF+/FFC/12FFC. At the same time we will tape-out a neural co-processor chip which will implement an NMAX array with L2 SRAM; it will have a x64 DDR port and PCIe interface. It will be available as well on a PCIe card. It will be able to run any NN using the NMAX Compiler for purposes of evaluation, demonstration and for validation of the silicon across process/voltage/temperature.
NMAX can be ported to any process node, on demand, in 6-8 months. Contact us at email@example.com if you want to ask about your process node. Flex Logix has extensive experience porting to process nodes: Sandia 180, TSMC 40/28/16/12/7 and GF14 so far.