You do not need know about FPGAs to integrate reconfigurable RTL into your SoC: our software maps your RTL into our EFLX array for you.  But if you are curious, read on.

FPGAs are Field Programmable Gate Arrays.  They offer a different kind of programmability from processors.  Processors are sequential while FPGAs enable massive parallelism.  A processor has one adder, one multiplier -- an FPGA can have dozens.  

An embedded FPGA is equivalent to the core on an FPGA chip which you can integrate into your chip, just like your integrate processors, DSPs and RAM (all of which also used to be separate chips).

An embedded FPGA, to be useful for a wide range of chips, needs to be designed and optimized differently than an FPGA chip.  An FPGA chip is optimized for one flavor of a node, implemented with maximum metal layers and full-custom logic design to minimize area.  But an embedded FPGA needs to be designed differently: it must be modular to offer a wide range of sizes, it must be able to work across all variations of a process node (e.g. TSMC 16FF+/FFC/12FFC) based on a single GDS proven in silicon then re-timed for different nodes, and it must use a small number of metal layers so as to be compatible with most metal stacks.  So literally using the core of an existing FPGA chip does not work well unless you can live with their metal stack and their exact process choices and limited array sizes. (For more details on this see Dense Scalable Portable Proven eFPGA GDS.)

The logic blocks in FPGAs offer Look Up Tables (LUTs) to implement any Boolean function: 4, 5, 6 inputs with one or two outputs.   LUTs often feed into carry chain/shift circuitry for implementation of adders or comparators.  As well, the LUTs feed into Flip-Flops which can be optionally bypassed.  The basic concepts haven't changed much in decades.  

The LUTs in EFLX100/TSMC 40 are dual 4-input LUTs with two bypassable Flip-Flops on the outputs.  Four dual-4-input LUTs are packed in a Reconfigurable Building Block (RBB) along with carry circuitry and 8 Flip-Flops.  

The LUTS in EFLX150/4K for TSMC 16FF+/16FFC and EFLX4K for TSMC 28HPC/HPC+ are 6-input LUTs, which can also be configured as dual 5-input LUTs with two bypassable Flip-Flops on the outputs.  

CLICK HERE to see how 6-input LUTs, used in all new designs in the Gen 2 architecture, improve performance 25% and density 20% compared to the dual-4-input-LUTs used in Gen 1.

For DSP applications, Multiplier-Accumulators (MAC) are useful for high performance and high density.  In EFLX arrays the MAC has a 22-bit pre-adder, a 22x22 multiple and a 48-bit post adder/accumulator.  MACs can be combined or cascaded to form fast DSP functions.  

2D Mesh FPGA Interconnect.png

The magic in FPGAs is the interconnect network that allows any logic block to connect to any other - this is also controlled by programming bits.  Traditional FPGAs use 2D-mesh architectures that can require 10+ metal layers and take up much more area than the logic blocks themselves.  Typically, in a traditional FPGA the interconnect uses 70-80% of the area of the "fabric" (the programmable part of the FPGA consisting of programmable logic and programmable interconnect).  And the complexity of traditional interconnect grows as N*N.  

ISSCC_Award outstanding paper 2014.jpg

But Flex Logix uses a new, patented architecture (the subject of the Outstanding Paper award at ISSCC 2014) which uses about half the area of the traditional interconnect and uses only 5-6 metal routing layers, but with very high utilization.  The interconnect network has been further improved in our "Gen 2" Architecture, first implemented in TSMC 16nm.  Here is the comparison for 28nm:

What is the new interconnect?  It is a Boundary-Less Radix Interconnect Network.  At first glance, it  looks like a hierarchical network was has been tried before, but it incorporates numerous improvements to improve spacial locality so as to cut area while at the same time maintaining performance.  The paper presented at ISSCC is copyrighted so please refer to the 2014 ISSCC proceedings for more detail.  We can reproduce the basic idea of the interconnect below.  Since starting Flex Logix, we have made numerous further improvements to the interconnect scheme culminating in our Gen 2 architecture recently announced for 16nm and coming soon for 28nm (and covered by further patents recently issued).

Boundary-Less Radix FPGA Interconnect network.png

FPGA chips today typically offer a lot of high-performance I/O using SERDES.  This is to give bandwidth sufficient to utilize the FPGA's high-performance capability.  But the FPGA chip I/O can take up 25%+ of the chip area and uses a lot of power, plus has high latency.  Embedded FPGA uses on-chip signaling which is very fast and very small, resulting in much more I/O and bandwidth.  The I/O in EFLX surrounds the array with separate inputs and outputs; each has an optional Flip-Flop. These are standard CMOS standard cells so they can run very fast.

The programmable logic blocks above are combined into a single EFLX array: LUTs/RBBs (and optionally some MACs) form the center of the array, in an enveloping mesh of programmable interconnect, surrounded by a thin ring of I/O (hundreds to thousands).  

All of it is programmable: the programming is done by Configuration Bits which set the values of the LUTs, the MACs and the interconnect so that the FPGA implements the exact RTL function the customer wishes.  The Configuration Bits typically are stored in the same Flash Memory as the code bits for the on-chip processor.

Software is critical for an FPGA.  The embedded FPGA is programmed using RTL or a netlist: Verilog or VHDL.  This is mapped into the FPGA architecture using an industry standard synthesis tool then the EFLX Compiler which packs, places, routes, generates timing and generates the Configuration Bit Stream to be loaded into the EFLX array to implement the RTL function.  [Synopsys is a Registered Trademark of Synopsys, Inc.]

Here is a brief demo of our software for mapping your RTL to EFLX to determine LUT count and performance (timing files vary by process node):

Customers typically want IP proven in silicon AND every customer wants a different array size.  This cannot be economically achieved by designing custom embedded FPGA sizes.

Flex Logix uses a building block approach. Each EFLX embedded FPGA IP core is a stand-alone FPGA, but incorporates additional top-level interconnect which allows automatic connection to adjacent IP cores turning them automatically into larger EFLX arrays. This strategy allows us to provide ~75 different array sizes from 100 LUTs to 200K LUTs in ~6 months from when we receive PDK and standard cell library and have a committed customer who works with us to ensure we optimize the circuit design for the right power/performance tradeoff for that market (the digital architecture remains the same).

Flex Logix implements two array sizes named for their logic capacity in LUT4 equivalents: the EFLX150 (and a DSP version where some LUTs are replaced with 2 MACs); and the EFLX4K (and a DSP version where some LUTs are replaced with 40 MACs).  The EFLX150 has ~200 inputs and ~200 outputs (depending on the process) and the EFLX4K has ~600 inputs and ~600 outputs.

When we port the EFLX IP cores to a new process, we implement a validation chip with at least 2x2 arrays of the core types to validate the inter-core interconnects; we use on-chip PLL and RAM to test the blocks at full performance so we are not limited by GPIO; we use PVT monitors so we know that we are validating at precisely the worst case temperature and voltage specs.

EFLX150 1x1 to 5x5.jpg

The EFLX150 IP core can be tiled/arrayed from 1x1 to 5x5 offering about 25 array sizes up to ~3.7K LUT4s.

EFLX4K 1x1 to 7x7.jpg

The EFLX4K IP core can be tiled/arrayed from 1x1 to 7x7 offering about 50 array sizes up to 200K LUT4s.

For any given array size, the application may require no DSP acceleration, a lot of DSP acceleration, or some DSP acceleration.  In any EFLX NxN array, the EFLX Logic and the EFLX DSP IP cores are interchangeable so you can get exactly the amount of DSP acceleration you need.

FPGAs are flexible but less area efficient than inflexible, hard-wired logic.  The same is true with embedded FPGA.  There is no fixed ratio of FPGA size to hard-wired size: it depends on the function being implemented and on how well the RTL is optimized for an FPGA architecture. In any case, the comparison may not be the right one: if you need flexibility, hard-wire won't provide it.

Embedded FPGAs can enable architectures not possible with FPGA chips.  Look at some examples:

1. Software reconfigurable I/O pin multiplexing,
2. Flexible I/O for MCU and IoT and even SoCs,
3. Extending battery life for MCU and IoT: EFLX can do DSP at lower energy than ARM,
4. Fast control logic for Reconfigurable Cloud Data Centers,
5. DSP Acceleration.
6. Debugger.
7. Reconfigurable Accelerators.

Here is a presentation Embedded FPGA for Architects and Physical Designers.

Here is a demonstration of the validation chip for the EFLX100 TSMC40ULP IP core in multiple array sizes and VT combinations at a nominal voltage of 50MHz (actual performance typically much higher):