We Now Use LUT4 Equivalents for Capacity

We have decided to follow the lead of Xilinx: even though they have used 6-input-LUTs for their FPGA chips for several generations, they name their FPGAs based on capacity in LUT4 equivalents using a ratio of 1 LUT6 = 1.6 LUT4, giving 60% more capacity for logic implementation.  Since Xilinx is the leader in FPGAs, many customers are used to this naming convention, so we're going with the flow.

So now our Gen 2 EFLX cores, which use 6-input-LUTs, are named EFLX4K and EFLX150, based on their capacity in LUT4-equivalents.  In our product briefs and data sheets, we'll clearly identify the number of 6-input-LUTs, as does Xilinx in theirs.

(Our original Gen 1 EFLX cores used dual-4-input LUTs, effectively a LUT5, which have about 1.2-1.3x more logic capacity than a single LUT4).

EFLX Gen 2 Uses LUT6 (six-input-LUTs) for Speed and Density

lut6.png

Flex Logix new EFLX Gen 2 architecture moved up to 6-input-LUTs (LUT6).

The EFLX4K core for TSMC16FFC/FF+ and for TSMC28HPC/HPC+ are Gen 2 architecture as well as the EFLX150 core for TSMC16FFC/FF+.  (only the T40ULP EFLX core uses the original Gen 1 architecture).  All future EFLX implementations will be Gen 2.  

Each LUT6 has two outputs and two flip flops, so it is also possible for the LUT6 to be used as two LUT5s, each with it's own flip flop (Out[0] inputs to FF[0] and Out[1] inputs to FF[1]).

Why did we change to LUT6?

 

Xilinx and Intel/Altera FPGA use Big LUTs

All modern Xilinx and Intel/Altera architectures use LUTs of at least 6 inputs.  They shifted to bigger LUTs for good reasons. Xilinx, in their 7 Series FPGA CLB User Guide, September 27, 2016, says that one LUT6 = 1.6 x LUT4.

Wider LUTs Make for Faster RTL Implementation

Our customers requested wider LUTs to process wide logic cones with less delay. 

LUT4 versus LUT6.png

A LUT6 is a little bigger and so has slightly longer delay than a LUT4.  But if a LUT6 can replace two LUT4s it is MUCH faster (LUT6 delay << 2 x LUT4 delay plus interconnect delay from one LUT4 to another).  A wider LUT structure will almost always result in higher speeds than a narrower LUT structure because of fewer logic levels.  We have compared numerous designs in our Gen 1 architecture and in our Gen 2 architecture, and see much better performance in Gen 2:  typical critical path reduction is about 25%, and can often be greater.

Also, a smaller LUT in a traditional FPGA mesh interconnect means logic levels and more useage of interconnect resources which can result in lower utilization meaning less area efficiency and lower speed because of longer connections on average.  EFLX utilization is typically 90% because of our patented hierarchical interconnect.

Gao, et al reached the same conclusion in their 2005 paper, “Analysis of the Effect of LUT Size on FPGA Area and Delay Using Theoretical Derivations”: that the optimal size for performance is LUT5 or LUT6. 

Wider LUTs Improve Logic Density

A commonly quoted guideline is that a LUT6 is equivalent to 1.6 LUT4 (see Peter Cheung, Department of Electrical and Electronic Engineering, Imperial College London, January 2008): this is the average over 163 benchmark designs with a few being >3 x LUT4.

Zia, et al found an even higher benefit for LUT6 over LUT4 in their 2013 paper, “Efficient Utilization of FPGA using LUT-6 Architecture.”

As a rough comparison of area, Flex Logix’ original Gen 1 EFLX2.5K core was 1.2mm2 in TSMC28HPM.  Flex Logix’ new Gen 2 EFLX4K core in TSMC28HPC+ is 1.6mm2. Several new features were added (100x faster test mode; faster interconnect for large arrays; circuitry for DFT coverage >98%; and configuration readback circuitry) – these new features, a more robust power/ground plane design for higher speeds in HPC+, and the shift from LUT4 to LUT6 accounted for the 33% increase in performance.  About 1/2 of the area increase was from the new features/more robust power design and 1/2 from switching to LUT6.  So the area increase due ONLY to switching to LUT6 is ~1.16x.

From numerous benchmarks we examined, we see that a typical design will use 30% (0.7x) fewer LUTs in Gen 2 than in Gen 1.

So an RTL implemented in Gen 2 will take on average about 1.16 x 0.7 = 0.812 or ~20% less area than in Gen 1.

CONCLUSION

For a given RTL, Gen 2 will use 20% less silicon area with 25% higher performance than Gen 1 with dual-4-input-LUTs (and will have even more of an advantage if compared to a single LUT4).

Comparing EFLX to competitor's density and performance will bring in many other variables such as the differences in interconnect topologies, etc., but LUT6 is a definite performance and density advantage by itself over smaller LUTs.

Copyright © 2015-2017 Flex Logix Technologies, Inc. EFLX and Flex Logix are Trademarks of Flex Logix Technologies, Inc.