Ambric Technology Backgrounder
The key to a practical solution for massively parallel embedded computing is a relentless focus on the right programming model first. Then Ambric invented new hardware architectures and circuit designs to enable this programming model.
Embedded Computing Problem
Complex high-bandwidth, real-time, streaming-media, sensor, and network applications for embedded systems can be achieved by programming a massively parallel processor array (MPPA) with software rather than using the hardware design required for ASICs and FPGAs. But a fundamental problem is how to productively program and validate a complex, irregular application on hundreds of processors. With conventional multiprocessing techniques, keeping processors busy, communicating, and synchronized is difficult and prone to failure, and won’t scale up to hundreds of CPUs.
Traditional architectures, such as CPUs and DSPs, are reaching limits in ease of development, performance, power and scalability. The most visible example is the industry's transition from frequency-scaling and its exponentially escalating problems - to scaling by adding multiple cores per die [5].
Single CPUs and DSPs can no longer keep track with the performance imperative of Moore's Law by trying ever-higher clock speeds or more elaborate implementation architectures, because they all suffer from diminishing returns.
Conventional multi-core processors won't scale for long, especially for embedded systems. Non-determinism, thread complexity, shared memory bottlenecks, explicit developer-managed synchronization all place fundamental limits on this approach [4].
ASICs and high-end FPGAs are getting harder and more expensive to develop. Global timing convergence, million-dollar plus ASIC NREs, FPGA static power issues and the RTL hardware design productivity gap between available and delivered ASIC gates [6] are among the growing difficulties.
It is better to spend transistors on supporting a more practical parallel programming model. It has been common for chip vendors to build parallel hardware architectures without adequate consideration of the programming model.
Structural Object Programming Model
Ambric's structural object programming model [1,2,3] is the foundation of how it all works:

An Ambric chip has an array of hundreds of 32-bit RISC processors that are each programmed with ordinary software, and hundreds of distributed memories. These are called objects because they obey strict encapsulation rules enforced in hardware. Objects simply run independently on their own parallel hardware, not shared or multi-threaded or virtualized. Structural objects are strictly encapsulated, execute with no side effects on one other, and have no implicitly shared memory.
Ambric objects communicate through a simple parallel structure of hardware channels. Each channel is word-wide, unidirectional, point-to-point from one object to another, and acts like a FIFO-buffer. Channels carry both data and control tokens, in simple or structured messages. Channel hardware synchronizes its objects at each end, dynamically as needed at run time, not scheduled at compile time.
Inter-processor communication and synchronization is straightforward. Sending or receiving a word on a channel is as simple as reading or writing a processor register. Sending a word through a channel is both a communication and a synchronization event. This keeps objects in step with each other in a common-sense way. Since Ambric channels synchronize transparently and locally, the application achieves high performance without complex global synchronization.
Channels provide a common hardware-level interface for all objects. This makes it easy for objects to be assembled into higher-level composite objects. They work the same way as leaf objects, since objects are encapsulated and only interact through channels. Design re-use is practical and robust in this system.
The result is much easier and cheaper development, the performance and power advantages of massive parallelism, and long-term scalability.
Ambric Registers and Channels
How these channels are built is the key to putting this programming model on a chip. Instead of ordinary synchronous registers, Ambric registers are used throughout the chip. A chain of Ambric registers is called an Ambric channel. Ambric channels are fully encapsulated, fully scalable technology for passing control and data between objects. Channel stages are entirely local. There are no wires longer than one stage. Ambric channels dynamically accommodate varying delays and workloads on their own. Changing the length of a channel has no effect on functionality, only its latency changes.
Ambric processors are interconnected in hardware through Ambric channels. Each processor runs independently, responding only to its own local channels. This is why local changes have only local effects, because it's a globally asynchronous system. This makes it possible to change the clock speed of each processor independently at runtime, to minimize power.
Self-synchronizing channels with local buffering in an asynchronous system, which is practical to program, fast and scalable, all become possible thanks to Ambric channels.
Processor Architecture
This programming model motivates a reassessment of the assumptions that today's processors have always been based on.
Traditional von Neumann architecture reads and writes variables in a memory space, bottlenecked through a register-memory hierarchy. Communication between processors is an afterthought. Efficient handling of streaming data such as media processing or packet processing is not inherent in conventional machines.
The objective of Ambric processor architecture is processing data and control from channels. Memories are encapsulated objects like any others, read and written using channels. Instruction streams arrive through channels. So channel communication is a first-class feature of Ambric's architecture.

This makes for very lightweight 32-bit RISC CPUs, which treat channels just like general registers. Every datapath is a self-synchronizing Ambric channel. Data and control tokens keep moving, so RAMs get used more for buffering, rather than as a static global workspace.
All Ambric processors can: execute an operation, do a loop iteration, input from channels, and output to a channel every cycle. That means a processor running single-cycle loop is equivalent performance to a configurable hardware block, like in an FPGA. But unlike FPGAs, the full range of performance versus complexity between pure hardware and ordinary software is easily available, and with software programming, not complex globally-synchronous RTL-based hardware design.
Chip Architecture: Compute Unit, RAM Unit
In Ambric's chip architecture, a cluster of four processors is a Compute Unit (or CU). It has two types of processors:
The SRD processor is a 32-bit streaming RISC processor with DSP extensions for math-intensive processing. It has local memory for its 32-bit instructions, and can directly execute more code from the RAM Unit next door.
The SR processor is a simpler 32-bit streaming RISC CPU, used mainly for managing channel traffic, generating complex address streams, and other utility tasks which enable sustained high throughput for the SRDs.
The RAM Units (or RUs) are the main on-chip memories that can stream addresses and data over channels.

Chip Architecture: Brics and Interconnect
CUs and RUs are combined into the top-level physical building-block called a bric. The core of the chip is assembled just by stacking up brics. The number of brics serves as a measure of compute-capacity.
Each bric has two CU-RU pairs, totaling 8 CPUs and 21 KBytes of SRAM.
Brics connect by abutment through channels that cross bric-to-bric. The CUs and RUs are arranged so that, in the array of brics, there are contiguous CUs and contiguous RUs.
To optimize the cost and performance of each channel, the chip's configurable interconnect is hierarchical with several levels of hierarchy. These channels are word-wide and run at up to 10 Gigabits per second.
These bric-long channels are the longest signals in the core, except for a low-power clock tree and the reset. So this physical architecture is very scalable going forward.
Chip Implementation
The core of Ambric's has 45 brics in an array, containing a total of 336 32-bit SR and SRD processors running at 350 MHz and 7.1 Mbits of distributed SRAM. At full speed, all processors together are capable of 1.2 trillion operations per second, over one teraOPS. This performance is supported by the interconnect's 792 gigabit per second bisection bandwidth, 26 Gbps of off-chip DDR2 memory, PCI Express at an effective 8 Gbps each way, and up to 13 Gbps of parallel general-purpose I/O.
Programming Model and Tools
According to the Ambric programming model, developing applications for this chip is straightforward, no "magic compilers," no scheduling, no synthesis.
The aDesigner integrated development environment (IDE) based on Eclipse — an open development platform from the Eclipse Foundation (http://www.eclipse.org) — lets objects be written in a subset of standard Java or assembly code or be loaded from libraries; the structure is defined with graphical block diagrams or a text-based structural language. An application simulator is available in the IDE. Objects and structure are automatically compiled, placed, and routed onto the Am2045. This generates a configuration file, which is loaded into the chip at runtime. Symbolic source-level parallel debugging and performance monitoring, in real time on the chip, is available through the IDE.

The developer starts by describing the application as a parallel structure of objects, and the data and control messages they send and receive. The process is to divide and conquer the application hierarchically, defining composite objects according to higher-level functional blocks, which is a very good match to the way developers think about an application. All the parallelism is encoded in the structure, leaving the leaf objects sequential.
Since Ambric objects, even large composite objects, are strictly encapsulated with simple common channel interfaces, they are easily reused. Once validated, encapsulation protects them, so they maintain correct behavior, with no need for re-validation.
Application-specific leaf objects are written in ordinary sequential code in the subset of a standard high-level language and/or assembly code, and compiled normally.
A functional simulator is available to run and do initial debugging in a software testbench environment on the desktop.
Finally the realization tool chain auto-maps all objects onto CUs and RUs, auto-routes the channels, and creates a configuration file for the chip. At runtime the chip is configured by a host or by itself from flash, similar to FPGAs.
Runtime debugging and performance or power tuning in the real system on real data at full speed is vital. Unused processors, memories and channels are straightforward to use for real-time debug - just by forking and copying channel traffic. The chip also includes a separate dedicated debug network. The developer can halt, step, restart processors with the usual debug features, plus observe or trap on channel events.
Performance Metrics
Ambric defined this programming model, architecture and tools to get very high performance from a massively parallel chip. Here are the results.
These execution examples compare the Am2045 with TI's 90nm high-end fixed-point VLIW DSP, and a large 90nm Xilinx FPGA.
| |
Am2045 45-bric Chip |
TI C641x DSP |
Xilinx Virtex-4 LX100 - LX200 |
| Process |
130nm |
90nm |
90nm |
| MHz |
350 MHz |
1,000 MHz |
500 MHz Nominal |
Published DSP
Benchmarks |
10-25X throughput,
1/3 the code |
1X |
n/a |
Multiply-Accum./Sec.
(16x16 to 32-bit) |
60 GMACS |
4 GMACS |
48 GMACS |
Ambric's DSP benchmark was created by implementing the same functionality as five published TI benchmark kernels. The Am2045 delivers a range of 10 to 25 times greater throughput when extrapolated to the chip-level. Its code is 1/3 the size and much less complex, without all the VLIW-style setup and teardown. Developing the benchmarks for this chip took one field application engineer a day and a half.
Am2045’s multiply-accumulate throughput is superior to 1 GHz 90nm DSPs and general-purpose logic-oriented high-end 90nm FPGAs. (Some DSP-centric FPGAs have larger nominal MAC ratings but offer only limited capacity for application logic.)
Big numbers on raw throughput and small benchmarks are great, but what about real applications?
Application Example: Motion Estimation
Consider a video encoding application, created for a customer benchmark: real-time Motion Estimation across two reference frames of broadcast-quality 720p high-definition video. Motion Estimation is the most compute-intensive part of video compression. This benchmark does full, exhaustive search, with sums of absolute differences (SAD) between pixels as the best-match criteria. (Commercial motion estimators in silicon or software typically use far less rigorous and computationally cheaper search methods.)
The first step in the algorithm is to take in the frames to compare along with candidate motion vectors. The frames are buffered in off-chip DRAM, from which the individual 16-by-16-pixel macroblocks are read and processed.
The search is done in parallel by a set of identical Motion Estimation objects, that each handle a different region. Data and results are streamed down a pair of channels, one for pixels and the other for results. This is the top level of a hierarchical design.
Each of those ME units is a composite object, consisting of 4 Calculation objects which each process one macroblock at a time, and a block to collect and choose the best results. Each of those Calculation objects is another composite object assembled from leaf objects that run on the processors and memories.

The motion estimator for a single reference frame is shown here; the full implementation has two of these which both fit in the Am2045 chip, using 89% of the brics. They only need to run at 300 MHz. Actual sustained performance is 0.46 trillion operations per second, which is over half of the peak theoretical performance available from these brics at this clock rate.
This shows how the architecture and programming model succeeds in delivering very high performance on real applications in a programmable chip.
Scalability
Ambric set out to overcome fundamental barriers to scalability in development cost and difficulty, timing, and power.
Our programming model's object-based modularity makes much greater design reuse possible. Its simple combination of objects written in normal software code, combined in a hierarchy of block-diagram structures, makes high-performance design development much easier and cheaper.
Our asynchronous system of processors, memories and self-synchronizing channels, with local synchronous clocking and no long wires, is very scalable into future process generations.
Conventional techniques increase power far out of proportion to performance increase. With parallel processors, performance per watt stays relatively constant. The Ambric chip and its programming model make that massive parallelism practical.

Starting from today's 1 teraOPS 130nm Am2045, long term scalability in more than one direction is available.
Increased performance at the same area and cost will come from process scaling and more custom implementation, leading to 65nm parts with over 1,000 processors and over 4 teraOPS.
Constant performance and smaller area for applications that need lower cost and energy also opens up, analogous to the low-cost FPGA and DSP families.
Massively Parallel Processing Arrays
The Ambric architecture is a member of an emerging class of parallel chips called Massively Parallel Processor Arrays. MPPAs are distinguished from "multi-core" conventional processors, which have only a few processors and a shared-memory architecture, by having massive parallelism of at least hundreds of processing elements and distributed memories, and a rich word-wide flexible interconnect fabric.
How do you define MPPAs and how do they differ? To learn more, view our page on Massively Parallel Processor Arrays.
Conclusions
Ambric has found practical solutions to the architectural and programming challenges that have stood in the way of achieving massively parallel embedded computing that delivers extremely high performance, which is silicon- and software-scalable long-term, and with reasonable software-only application development effort.
It's not enough to put lots of processors on a chip without thinking about how they're programmed. We started with our Structural Object Programming Model, which scales without limit, and designed our silicon to enable it.
Its practical access to massive parallelism realizes an order-of-magnitude increase in the throughput available from a programmable chip in a given silicon process and area.
Its modularity, parallel low-power, and local timing, mean Ambric's object-based technology can continue to track Moore's Law for many process generations to come.
References
[1] Michael Butts, A. M. Jones, Paul Wasson. A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing. Proc. IEEE Symposium on Field-Configurable Custom Computing Machines (FCCM) 2007, pp. 55-64. [back]
[2] Michael Butts. Synchronization through Communication in a Massively Parallel Processor Array. IEEE Micro, vol. 27 no. 5, pp. 32-40, Sept./Oct. 2007. [back]
[3] A. M. Jones, Michael Butts. TeraOPS Hardware: A New Massively-Parallel MIMD Computing Fabric IC. IEEE Hot Chips Symposium, August 2006. [back]
[4] Edward A. Lee, "The Problem with Threads," in IEEE Computer, 39(5):33-42, May 2006. [back]
[5] Jan M. Rabaey, "Design at the End of the Silicon Roadmap," Keynote Presentation, ASPDAC 2005, Shanghai, January 2005. [back]
[6] Gary Smith, "The Crisis of Complexity," Dataquest briefing, 40th Design Automation Conference, 2003. [back]
|