You do not need know about FPGAs to integrate reconfigurable RTL into your SoC: our software maps your RTL into our EFLX array for you. But if you are curious, read on.
FPGAs are Field Programmable Gate Arrays. They offer a different kind of programmability from processors. Processors are sequential while FPGAs enable massive parallelism. A processor has one adder, one multiplier -- an FPGA can have dozens.
An embedded FPGA is equivalent to the core on an FPGA chip which you can integrate into your chip, just like your integrate processors, DSPs and RAM (all of which also used to be separate chips).
An embedded FPGA, to be useful for a wide range of chips, needs to be designed and optimized differently than an FPGA chip. An FPGA chip is optimized for one flavor of a node, implemented with maximum metal layers and full-custom logic design to minimize area. But an embedded FPGA needs to be designed differently: it must be modular to offer a wide range of sizes, it must be able to work across all variations of a process node (e.g. TSMC 16FF+/FFC/12FFC) based on a single GDS proven in silicon then re-timed for different nodes, and it must use a small number of metal layers so as to be compatible with most metal stacks. So literally using the core of an existing FPGA chip does not work well unless you can live with their metal stack and their exact process choices and limited array sizes. (For more details on this see Dense Scalable Portable Proven eFPGA GDS.)
The logic blocks in FPGAs offer Look Up Tables (LUTs) to implement any Boolean function: 4, 5, 6 inputs with one or two outputs. LUTs often feed into carry chain/shift circuitry for implementation of adders or comparators. As well, the LUTs feed into Flip-Flops which can be optionally bypassed. The basic concepts haven't changed much in decades.
The LUTs in our first generation (Gen 1) EFLX100/TSMC 40 are dual 4-input LUTs with two bypassable Flip-Flops on the outputs. Four dual-4-input LUTs are packed in a Reconfigurable Building Block (RBB) along with carry circuitry and 8 Flip-Flops.
In our 2nd generation of our architecture, Gen 2, the LUTS in EFLX150/4K for TSMC 16FF+/16FFC and EFLX4K for TSMC 28HPC/HPC+ are 6-input LUTs, which can also be configured as dual 5-input LUTs with two bypassable Flip-Flops on the outputs.
For DSP applications, Multiplier-Accumulators (MAC) are useful for high performance and high density. In EFLX arrays the MAC has a 22-bit pre-adder, a 22x22 multiple and a 48-bit post adder/accumulator. MACs can be combined or cascaded to form fast DSP functions.
The magic in FPGAs is the interconnect network that allows any logic block to connect to any other - this is also controlled by programming bits. Traditional FPGAs use 2D-mesh architectures that can require 10+ metal layers and take up much more area than the logic blocks themselves. Typically, in a traditional FPGA the interconnect uses 70-80% of the area of the "fabric" (the programmable part of the FPGA consisting of programmable logic and programmable interconnect). And the complexity of traditional interconnect grows as N*N.
But Flex Logix uses a new, patented interconnect, XFLX™(the subject of the Outstanding Paper award at ISSCC 2014), which uses about half the area of the traditional interconnect and uses only 5-7 metal routing layers, but with very high utilization. The interconnect network has been further improved in our "Gen 2" Architecture, first implemented in TSMC 16nm. Here is the comparison for 28nm:
Why does it matter how many metal layers the eFPGA IP takes? The reason is that every metal layer reduces the number of metal stacks that the IP can be compatible with. In advanced process nodes there can be 20-35 different metal stacks; the lowest levels tend to be the same across all stacks due to common IP (standard cells, SRAM, etc) but then they start to diverge quickly after that because each metal stack is optimized for very different applications.
Since EFLX eFPGA IP uses the fewest metal layers, we are compatible with almost all metal stacks. eFPGA IP based on an FPGA chip likely uses maximum metal layers and is not compatible with any metal stack but the one they chose.
What is the new interconnect? It is a Boundary-Less Radix Interconnect Network, but we call it XFLX for short. At first glance, it looks like a hierarchical network that has been tried before, but it incorporates numerous improvements to improve spacial locality so as to cut area and reduce metal layers while at the same time maintaining performance. The paper presented at ISSCC is copyrighted so please refer to the 2014 ISSCC proceedings for more detail. We can reproduce the basic idea of the interconnect below. Since starting Flex Logix, we have made numerous further improvements to the interconnect scheme culminating in our Gen 2 architecture recently announced for 16nm and coming soon for 28nm (and covered by further patents recently issued). Despite being denser and using fewer metal layers, XFLX surprisingly achieves higher utilization than mesh interconnect: FPGA chips often have 60-70% utilization whereas using XFLX, the EFLX eFPGA is typically 90% utilization (one recent customer design of a very large array achieved 97.7% utilization).
FPGA chips today typically offer a lot of high-performance I/O using SERDES. This is to give bandwidth sufficient to utilize the FPGA's high-performance capability. But the FPGA chip I/O can take up 25%+ of the chip area and uses a lot of power, plus has high latency. Embedded FPGA uses on-chip CMOS signaling which is very fast and very small, resulting in much more I/O and bandwidth. The interface pins of an EFLX eFPGA surround the array with separate inputs and outputs; each has an optional Flip-Flop. These are standard CMOS standard cells so they can run very fast.
The programmable logic blocks above are combined into a single EFLX array: LUTs/RBBs (and optionally some MACs) form the center of the array, in an enveloping mesh of programmable interconnect, surrounded by a thin ring of I/O (hundreds to thousands).
All of it is programmable: the programming is done by Configuration Bits which set the values of the LUTs, the MACs and the interconnect so that the FPGA implements the exact RTL function the customer wishes. The Configuration Bits typically are stored in the same Flash Memory as the code bits for the on-chip processor.
Software is critical for an FPGA. The embedded FPGA is programmed using RTL or a netlist: Verilog or VHDL. This is mapped into the FPGA architecture using an industry standard synthesis tool then the EFLX Compiler which packs, places, routes, generates timing and generates the Configuration Bit Stream to be loaded into the EFLX array to implement the RTL function. [Synopsys is a Registered Trademark of Synopsys, Inc.]
Here is a ~10 minute video demonstration of the key features of EFLX Compiler:
Customers typically want IP proven in silicon AND every customer wants a different array size. This cannot be economically achieved by designing custom embedded FPGA sizes.
Flex Logix uses a building block approach. Each EFLX embedded FPGA IP core is a stand-alone FPGA, but incorporates additional top-level interconnect which allows automatic connection to adjacent IP cores turning them automatically into larger EFLX arrays. This strategy allows us to provide ~75 different array sizes from 100 LUTs to 200K LUTs in ~6 months from when we receive PDK and standard cell library and have a committed customer who works with us to ensure we optimize the circuit design for the right power/performance tradeoff for that market (the digital architecture remains the same).
Flex Logix implements two array sizes named for their logic capacity in LUT4 equivalents: the EFLX150 (and a DSP version where some LUTs are replaced with 2 MACs); and the EFLX4K (and a DSP version where some LUTs are replaced with 40 MACs). The EFLX4K has >1000 interface pins: 632 in and 632 out.
When we port the EFLX IP cores to a new process, we implement a validation chip with at least 2x2 arrays of the core types to validate the inter-core interconnects; we use on-chip PLL and RAM to test the blocks at full performance so we are not limited by GPIO; we use PVT monitors so we know that we are validating at precisely the worst case temperature and voltage specs.
The EFLX150 IP core can be tiled/arrayed from 1x1 to 5x5 offering about 25 array sizes up to ~3.7K LUT4s.
The EFLX4K IP core can be tiled/arrayed from 1x1 to 7x7 offering about 50 array sizes up to at least 200K LUT4s. Arrays can be square OR rectangular.
For any given array size, the application may require no DSP acceleration, a lot of DSP acceleration, or some DSP acceleration. In any EFLX NxN array, the EFLX Logic and the EFLX DSP IP cores are interchangeable so you can get exactly the amount of DSP acceleration you need.
FPGAs are flexible but less area efficient than inflexible, hard-wired logic. The same is true with embedded FPGA. There is no fixed ratio of FPGA size to hard-wired size: it depends on the function being implemented and on how well the RTL is optimized for an FPGA architecture. In any case, the comparison may not be the right one: if you need flexibility, hard-wire won't provide it.
We actually have fabricated a 7x7 array, the EFLX200K validation chip in TSMC16FFC. There are 5 rows of EFLX4K Logic cores and 2 rows of EFLX4K DSP cores. It is fully validated and copies of the validation report may be reviewed under NDA.
Embedded FPGAs can enable architectures not possible with FPGA chips. Look at some examples:
1. Software reconfigurable I/O pin multiplexing,
2. Flexible I/O for MCU and IoT and even SoCs,
3. Extending battery life for MCU and IoT: EFLX can do DSP at lower energy than ARM,
4. Fast control logic for Reconfigurable Cloud Data Centers,
5. DSP Acceleration.
7. Reconfigurable Accelerators.
Here is a presentation Embedded FPGA for Architects and Physical Designers.
Here is a demonstration of the evaluation board for the EFLX200K array, a 7x7 implementation of the EFLX4K Logic and DSP cores. Click on the image for a ~5 minute video.