Category Archives: DISTRIBUTED PARAMETER SYSTEMS

MULTICHIP SYSTEMS

Special care must be taken in the design of large and mul­tichip reconfigurable systems. In this section, we describe some theoretic results relevant to the major architectural and issues associated with such designs.

Interconnect Organization

Figure 8. Hardware compilation example. The C program is translated into a datapath (top) and control (bottom). Execution of statements in the while loop are controlled by s1 and s2; s0 and s3 correspond to the start signals of the statements before and after the while loop.

Figure 9. Clos network. A Clos network contains three stages: inputs, intermediate switches, and outputs. The input and output stages are symmetric. In the figure, the first-stage has r n x m switches, the second-stage has mr x r switches, and the third-stage has r m x n switches.

A classic Clos network (57) contains three stages: inputs, intermediate switches, and outputs, as shown in Fig. 9. It can be used to interconnect pins in a reconfigurable com­puting system, and its input and output stages are symmet­ric. Suppose the first stage has r nx m crossbar switches, the second stage has m r x r switches, and the third stage has rmx n switches, let us denote the network as c(n, m,r). For any two-pin net interconnect requirement, the network c(n, m, r) can achieve complete routability if m is not less than n. The routing method can be described by recursive operations (58). In the first iteration, we reduce the net­work to c(n — 1, m — 1, r). In the ith iteration, we reduce the network to c(n — i, m — i, r). When n — і = 1, we have r 1 x (m — n + 1) switches in the first stage, m — n + 1 r x r
switches in the second stage, and r(m — n + 1) x 1 switches in the third stage. In other words, only one input exists in each first-stage switch and one output in each third-stage switch. In this case, one second-stage r x r switch is enough to route the r inputs of r first-stage switches to the r outputs of r third-stage switches, thus completing the interconnect.

The reduction from c(n — i, m — i, r) to c(n — і — 1,m — і — 1,r) can be derived by a maximum matching algorithm. The matching algorithm selects disjoint signals from dif­ferent input switches to different output switches. One second-stage switch is then used to route the selected sig­nals. From Hall’s theorem, the maximum matching and

routing can always reduce the network to c(n — і — 1,m —

і — 1, r).

Conceptually, the routing problem can also be formu­lated as edge coloring on a bipartite graph G(V1, V2, E) (31). The node sets V1 and V2 represent the switches in the input and output stages, respectively. An edge in E represents a two-pin net interconnect requirement between the cor­responding input and output switches. In Reference (31), Chan and Schlag assigned colors to the edges of the bi­partite graph. Edges of the same color are bundled into one group and the corresponding set of nets are routed by one switch in the second stage. The work of Reference (59) was then used to find a minimum edge coloring solution in O(|E|logn).

The three-stage Clos network can be folded into a two – stage network (Fig. 10) so that the inputs and outputs are mixed in the first stage. Thus, the corresponding bipartite graph G(Vi, V2, E) constructed above for edge coloring is also folded with V1 and V2 merged into one set.

To find the routing assignment, the folded edge color­ing graph can be unfolded back to a bipartite graph using an Euler path search. The Euler path traverses every edge exactly once and defines the edge direction according to the direction of the traversal. We then recover the origi­nal bipartite graph by splitting the node set back into two sets V1 and V2 and unfold the edges such that all edges are directed from V1 to V2. We can find the minimum edge coloring solution of the unfolded bipartite graph and apply the solution back to the folded routing problem.

In practice, the first-level crossbar of the Clos network is replaced with FPGAs to save board space (Fig. 11). Routability is worse than an ideal Clos network. Even with a true Clos network, complete routability of multipin nets is not guaranteed, which is an important practical consid­eration because in microelectronic design, many multipin nets typically exist.

In an attempt to solve the multipin net and routability problem, we can introduce extra connections among FPIDs as shown in Fig. 12. However, extra FPID interconnections also incur extra delay. We can also expand the fanout width of FPGAs so that each FPGA I/O pin is connected to more than one FPIC (60, 61). The fanout width expansion im­proves routability without significant additional delay. The multiple appearances of I/O pins increase the probability that a signal connection can be made in a single stage, which is especially critical for multipin nets. However, the additional fanouts increase the needed pin count of FPICs. Thus, we need to find a balanced fanout distribution that reduces the interconnect delay with a minimal pin require­ment.

A tree-structured network can simplify the mapping process for certain applications. In Reference (62), an ex­ample of a tree-structured network is illustrated for a Very Large Scale Simulator (VLSS). The VLSS tree structure has all logic components located at the leaves and intercon­nect switches at the internal nodes. The machine covers a capacity of eight million gates. Each branch is an 8-bit bus. The higher up the level of the tree, the less parallelism the signal distribution can achieve. Therefore, a partitioning process is designed to minimize the high level interconnect and maximize the parallel operation.

Interconnect Multiplexing

Time multiplexing is an effective method for tackling the scalability problem in interconnecting large designs. The time-sharing method can be extended from traditional bus organization (27, 62) to network sharing (63) and further to function block sharing (40).

Interconnect can be time shared as a bus (27, 62). If n communication lines exist between two FPGAs, they can be reduced to a single line by storing logical outputs in shift registers and time-multiplexing the communication in phases. Such a scheme was employed in the virtual wires logic emulation system (27), which is efficient because in­terconnects are normally capable of being clocked at much higher rates than the critical path ofthe rest ofthe system, and all logical wires are not simultaneously active. This scheme can greatly reduce the complexity of the intercon­necting network or printed circuit board in a multi-FPGA system.

Li and Cheng (63) proposed that a dynamic network be viewed as overlapping L conventional FPICs together but sharing the same I/O pins. A dynamic routing architec­ture can increase the routability and shorten interconnect length. Each switching network is a full crossbar, which can be reconfigured to provide any connections among I/O pins. The select lines are used to activate only one switching net­work at a time; thus the I/O pins are dynamically connected according to the configuration of this active switching net­work. By dynamically reconfiguring the FPICs, L logic sig­nals can time-share the same interconnect resources.

Memory Allocation

Interconnect schemes should also consider how memory is connected to the FPGAs. Although combining memory with logic in the same FPGA is the most desirable method for reducing routing congestion and signal delay, separate components can supply much larger capacity at higher den­sity and lower price. Figure 13 demonstrates three differ­ent ways of allocating the memories in a Clos network (31, 64). The memory may be attached directly to a local FPGA (Fig. 13a), attached to the second-stage switches ofthe Clos network via a host interface (Fig. 13b), or attached to the first-stage switches ofthe Clos network (Fig. 13c). The first method provides good performance for local memory ac­cess. However, for the case of nonlocal memory access, the routability and delay are concerns. The second method is slower than the first method for local memory accesses but provides better routability. The third is the most flexible as the memory is attached to the network and the routabil – ity is high. However, every logic-to-memory communication must go through the second interconnect stage.

Bus Buffer Insertion

In FPGAs, signal propagation is inherently slow because of its programmable interconnect feature. However, the delay of long routing wires can be drastically reduced by buffer insertion. The principle at work is that by inserting buffers we can decouple capacitive effects of components and in­terconnect driven by the buffers and thereby improve RC delay.

Given a routing topology for a net and timing require­ments for its sinks, an efficient optimal buffer insertion algorithm was proposed in Reference 65. Experimental re­sults show dramatic improvement versus the unbuffered solution. Thus, it is advantageous to have abundant buffers in FPGAs. However, each possible buffer and its pro­grammable switch adds capacitance to the wires, which in turn will contribute to delay. Thus, a balance point needs to be identified to trade off between the additional delay and capacitance of the buffers versus the improvement they can provide.

Figure 10. Folded Clos network. The three-stage Clos network is folded into a two-stage network so that the inputs and outputs are mixed in the first stage.

Figure 11. Variations of the Clos network. The first level crossbar of the Clos network is replaced with FPGAs to save board space. Routability is worse than an ideal Clos network.

FPGA

FPGA

FPGA

FPGA

FPID

FPID

FPID

FPID

Figure 12. Variations of Clos network. The fanout width of FPGAs is expanded so that each FPGA I/O pin is connected to more than one FPIC. The fanout width expansion improves routability without significant additional delay.

For a multisourced bus, the problem of buffer insertion becomes more complicated, because the optimization for one source may sacrifice the delay of others. Furthermore, the direction of the buffer needs to be arbitrated by a con­troller. Instead of using such a controller, a novel approach is to use a patented open collector bus repeater (66). When idle, the two ends of the repeater are set to high. When the repeater senses the pull-down action on one side, it presents the signal on the other side until the pull-down action is released from the originated signal. The bus re­peater eliminates the need for a direction control signal, resulting in a simpler design and better use of resources.

System Decomposition

To decompose a system into multiple devices, Yeh et al. (67) proposed an algorithm based on the relationship between uniform multi-commodity flow and min-cut partitioning. Yeh et al. construct a flow network wherein each net ini­tially corresponded an edge with flow cost one. Two random modules in the network were chosen and the shortest path (i. e., path with lowest cost) between them was computed. A constant A < 1 was added to the flow for each net in the shortest path, and the cost for every net in the path was incremented. Adjusting the cost penalizes paths through congested areas and forces alternative shortest paths. This random shortest path computation is repeated until every path between the chosen pair of modules passes through at least one “saturated” net. The set of saturated nets induces a multi-way partitioning in which two modules belong to the same cluster if and only if there is a path of unsaturated nets between them.

For each of these clusters, the flux (defined as the cut – size between the cluster and its complement, divided by the size of the cluster) is computed and the clusters are sorted based on their flux value. Yeh et al. began with a single cluster equal to the entire netlist, and then peeled off the clusters with lowest flux. This approach was attrac­tive because the saturated nets are good candidates to be cut in a partitioning solution. As peeled clusters can be very small, a second phase may be used to make the multi-way partitioning more balanced. This approach, with its sub­sequent speedup by Yeh (68), is well-suited for large-scale multi-way partitioning instances.

The system prototyping phase may also explore netlist transformations such as logic replication and retiming to minimize cut size (I/O usage) or system cycle time. Such transformations are needed as inter-device delays can be relatively large and because devices are often I/O-limited. In Reference (69), Liu et al. proposed a partitioning algo­rithm that permits logic replication to minimize both cut size and clock cycle of sequential circuits. Given a netlist G = (V, E), their approach chooses two modules as seeds s and t, then constructs a “replication graph” that is twice the size of the original circuit. This graph has the spe­cial property that a type of directed minimum cut yields the replication cut (i. e., a decomposition of V into S, T, and R = V — S — T where s є S, t є T and R is the replicated logic) that is optimal. A directed version of the Fiduccia – Mattheyses algorithm is used to find a heuristic directed minimum cut in the replication graph. Cong et al. (70) present an efficient algorithm for the performance-driven multi-way circuit partitioning problem that considers the different local and global interconnect delay introduced by the partitioning.

Alpert and Kahng (71) survey the FPGA partition­ing literature in the context of major graph partitioning paradigms. The current partitioning problems are 1) low usage rate of FPGA gate capacity because I/O pin limit, 2) low clock rate because of interconnect delay between mul­tiple FPGAs and 3) long CPU time for the mapping process.

System Planning and Design Changes

For a given system decomposition to be implemented on a multi-FPGA prototyping architecture, all connections within each device and between devices must be routable. Chan et al. (72) invoke much literature on routability pre­diction in gate arrays, as well as theoretical concepts, such as the Rent parameter, to obtain a fast routability esti­mate for arbitrary netlists and FPGA architectures. Their method ascribes one of three levels of routable (easily routable, marginally routable, or unroutable) to a netlist based on various parameters. Specifically, combining a wirelength estimator due to Feuer, the average number of pins-per-cell, and the estimated Rent parameter yields a relatively accurate routability predictor. The utility of these parameters is contrasted with that of other criteria such as El Gamal’s channel width requirement (73) or the

(a)

Ґ >

X

Logic

X

X

V J

fe

X:

Ґ N

X

X

Logic

X

v )

X

outer stage

к ^

/

//V

/

/ / > \

/X|

Figure 13. Memory organization, (a) Memory is attached directly to a local FPGA. (b) Memory is attached to the second-stage switches of the Clos network via a host interface, (c) Memory is attached to the first-stage switches of the Clos network.

average pins-per-net ratio.

In addition to routability, connections must also meet system timing constraints. Selvidge et al. (74) ex­tend the original virtual wires (27) concept in their TIERS (Topology-IndEpendent Routing and Scheduling) approach. The problem formulation assumes that an as­signment from a multiple-FPGA partitioning (i. e., a design graph) to a target topology graph has already been made. The objective is to assign “links” (i. e., signal nets) to chan­nels between devices; as with the Virtual Wires concept, specific timeslices for a channel can be assigned to multi­ple links as long as no two links need to transmit signals at the same time. The TIERS algorithm uses a greedy method to order the links and then routes each link in the sched­uled order while reserving channel resources; factors of up to 2.5 improvement in system cycle time are achieved.

Chang, et al. (75) address the combined issues of routability and system timing by applying layout-driven logic resynthesis techniques. For a given wire that cannot be routed, “alternative wires” and alternative functions are identified, such that the given unroutable wire can be re­moved from the circuit and replaced with a new wire (or wires) or new logic without affecting functionality. Cheng et al. estimate that between 30% and 50% of wires have so- called “triple-wire alternatives” (i. e., replacements consist­ing of three or fewer wires). Their method first routes the wires that do not have any alternatives then replaces any unroutable wire with available alternatives. System tim­ing can be improved by replacing long wires with shorter alternatives.

CONCLUSION

Reconfigurable computing offers a middle ground between software-based systems and ASIC implementations, and is often able to combine important benefits of both. Im­plementations are able to avoid overheads such as unnec­essary data transfers, decoding and control mandatory in microprocessors, and designs can be optimized on a basis specific to an application, a problem instance or even an execution. Using this technology, it is possible to achieve size, performance, cost, or power improvements over more conventional computing technologies.

ACKNOWLEDGMENTS

The authors would like to thank Y M. Lam for his help in preparing this manuscript and Prof. Wayne Luk (Imperial College) for his proofreading of this article.

[1] K OPERATION

The relatively low Tc of Nb-Ti allows means that consid­erable gains in Hc2 and and Jc can be gained from low­ering the temperature of operation using superfluid liq­uid He, typically at 1.8 to 2 K. This has been recently exploited by the Large Hadron Collider project at CERN, which operates at 1.9 K so that the magnetic field can be pushed beyond 8 T (71). Boutboul et al. (72) have shown

[2] Mechanical breaker (air, air blast, vacuum, vacuum and magnetic field)

[3] Water cooled fuses (activated by water flow interrup­tion)

[4] Explosively actuated breaker (fuse and fuseless)

[5]3 < Cf (Tc – Tb )[1 + f (ht, K, C, p, • . •)]

[6] “Thermal” and “nonthermal ” are used inconsistently in the bio­

effects literature, at times referring to mechanism for producing an effect, and at times (even in the same paper) referring to the presence or absence of what the investigator considers to be a sig­nificant change in temperature. Some investigators use the term athermal to indicate a biological effect that occurs in the absence of noticeable heating, but which is still produced by thermal mech­anisms. An example of an athermal effect might be a physiologi­cal response caused by thermoregulation, although no noticeable change in body temperature occurs.

[8] A walk in a graph G = (V, E) is a sequence of nodes w = [v1,v2,.. .vk], k > 1, such that ( j, j + 1) є E, j = 1, 2,.. .k — 1. A walk is closed if k > 1 and v1 = vk.

DESIGN METHODS

Hardware description languages (HDLs) such as the Very High Speed Integrated Circuit Hardware Description Lan­
guage (VHDL) and Verilog are commonly used to specify the logic of a reconfigurable system. Descriptions in these languages have the advantage of being vendor neutral, so the same description can be synthesized for different tar­gets such as different FPGA devices, different FPGA ven­dors, and ASICs. For this reason, these languages are often the target language for higher level tools that offer higher levels of abstraction.

Module generators and libraries are commonly deployed to promote reuse. For example, vendors such as Altera and Xilinx have parameterized libraries of components that can be used in a design. These libraries are generated so that a circuit optimized for the particular application can be produced. As an example, a parameterized floating point library might allow the wordlength of the exponent and mantissa to be specified as well as whether denormalized numbers are supported. The module generator then gener­ates a netlist or VHDL-based floating point adder that can be included in a design.

A high level language can be directly mapped to a netlist or HDL. As an example, Luk and Page described a simple compilation process (8, 49) from a high level language with explicit parallel extensions to a register transfer language (RTL) description. Parallel execution of statements is im­plemented via parallel processes, and these can commu­nicate via channels through which a single-word message can be passed. Variables in the user program are mapped to registers, all expressions are implemented as combina­tional logic, and multiplexers are used in the case a reg­ister has multiple sources. A datapath that matches the dataflow graph of the input source description is gener­ated using this strategy. The clocking scheme employed is a global, synchronous one, and a convention that each as­signment takes exactly one clock cycle is followed. A start signal is used to feed the clock and to enable each register that corresponds to a variable, and a finish signal is gen­erated for the assignment in the following clock cycle. To execute statements sequentially, the start and finish sig­nals of adjacent statements are simply connected together, creating a one-hot distributed control scheme. Conditional statements and loops are formed by asserting one of sev­eral possible start signals that correspond to alternative basic blocks in a program. Completion ofconditional or loop constructs and synchronization of parallel blocks are im­plemented by combining relevant finish signals using the appropriate combinatorial logic. An example showing the translation of a simple code fragment to control and data­path is shown in Fig. 8.

Commercial tools that can compile standard program­ming languages such as Java, C, or C++ [e. g., (49)] are avail­able. Examples include Handel-C from Celoxica (50) and Catapult C from Mentor Graphics (51). The use of tradi­tional programming languages improves productivity as low level details are handled by the compiler. This is anal­ogous to C versus assembly language for software devel­opment. Another difference with potentially large implica­tion is that, using these tools, software developers can also design reconfigurable computing applications. Domain – specific languages such as MATLAB/Simulink (52) offer even greater improvements in productivity because they are interactive, include a large library of primitive rou­tines and toolkits, and have good graphing capabilities. Indeed, many designs for communications and signal pro­cessing are first prototyped in MATLAB and then con­verted to other languages for implementation. Tools such as the MATCH compiler (53) and Xilinx System Generator can translate a subset of MATLAB/Simulink directly to an FPGA design.

The availability of embedded operating systems such as Linux for microprocessors on an FPGA provides a fa­miliar software development environment for program­mers, greatly facilitating program development through the availability of a large range of open-source libraries as well as high quality development tools. Such tools can greatly speed up the development time and improve the quality of embedded systems. Hardware/software codesign tools such as Altera’s Nios II C-to-Hardware acceleration compiler enable time-critical functions in a C program to be converted to a hardware accelerator that is tightly coupled to a microprocessor within the FPGA (54).

Issues developing with the mapping of algorithms to hardware are more generally discussed by Isshiki and Dai (55), who focus on the differences between implementing bit-serial versus bit-parallel modules (e. g., adders and mul­tipliers) on FPGA architectures. Although latency is larger for bit-serial modules, the reduction in area frequently makes area-time products significantly lower for such im­plementations. More specifically, such advantages as the following can be obtained: 1) For bit-parallel modules, the I/O pin limitation is a major problem, and the large size of the module cluster can result in unused space and un­derutilized logic resources; 2) bit-serial modules are easier to partition as cell-to-cell connections are sparse and do not cause I/O problems; and 3) high fanout nets can impair routability of bit-parallel modules. Leong and Leong (56) generalized further with a design methodology that can translate a dataflow description with signals of different wordlengths to a digit serial design.

RUNTIME RECONFIGURATION

A reconfigurable computing system can have its functional­ity updated during execution, resulting in reduced resource requirements. A runtime reconfigurable system partitions a design temporally so that the entire design does not need to be resident in the FPGA at any given moment (38, 39). Configuration and execution can be overlapped to improve performance in the presence of reconfiguration latency. Us­ing this technique, designs that are larger than the physi­cal hardware resources can be realized in an efficient man­ner.

Dharma, a time-sharing FPGA architecture, was pro­posed that contains a functional block and an interconnect network (40). The interconnect and the logic can be time – shared. The authors proposed that emulated design topol­ogy be levelized in a folded pipeline manner; this topology simplifies the architecture and provides predictable inter­connect delay (Fig. 7).

Figure 4. Example of a logic emulation system. Arrays ofFPGAs and FPICs reside on the emulation modules. The user inputs the emulated design netlist and commands from the workstation. The workstation and control processor personalize the emulation modules, which are used in place of the emulated chip.

Figure 5. SPLASH2 architecture. Each board contains 16 FPGAs, XI through XI6. The blocks Ml through Ml6 are local memories of the FPGAs. A simplified 36-bit bus crossbar, with no permutation of the bit-lines within each bus, interconnects the 16 FPGAs. Another 36-bit bus connects the FPGAs in daisy-chain fashion. The local memories are dual ported with one port connecting to the FPGAs and the other port connecting to the external bus.

Single context, partially reconfigurable, and multiple context architectures have been proposed. In a single context system, any changes to the functionality of the FPGA involves reloading the entire bitstream; early FP – GAs were ofthis type. This scheme has the disadvantage of long reconfiguration time. Partial reconfiguration, as sup­ported by the Xilinx Virtex FPGAs (10), allows portions of the FPGA to be changed via a memory mapped scheme, whereas the other portions of the FPGA continue function­ing. Compared with a single context scheme, area overhead is associated in providing this feature. Multiple context architectures, such as NEC’s Dynamically Reconfigurable Processor (DRP) (41), allow a number of complete configu­rations to be stored in the fabric simultaneously and thus reconfiguration can be achieved in a small number of cy­
cles. This architecture has the shortest context switch time, however, a larger area overhead is associated with imple­mentation of this scheme.

The logical unit of reconfiguration could be at a num­ber of levels including the application, instruction, task, block, or sub-block level. An example of application-level reconfiguration could simply involve loading a runtime – dependent bitstream to support a particular coding stan­dard in a video coding application. The Dynamic Instruc­tion Set Computer (DISC) (42) supported demand-driven modification of the instruction set through partial reconfig­uration. The commercial Stretch processor (43) combines reconfigurable fabric with a processor to support the exe­cution of custom instructions implemented on a reconfig – urable fabric. Furthermore, the fabric can be reconfigured at runtime and the design environment is software-centric, with programming of the processor being in Stretch C.

An operating system for guarantee-based scheduling of hard real-time tasks has been proposed (44). Under control of software running on a microprocessor, task circuits can be scheduled online and placed in a suitable free space in a hardware task area. Communications between tasks and I/O are done though a task communication bus, and termi­nation of a task frees the reconfigurable resources used. It

Control FPGA

Control FPGA

FPGA +

FPGA +

4x 4GB DDR2

4x 4GB DDR2

DRAM+

DRAM+

4x MGT

4x MGT

4xIB4X Infiniband

4xIB4X Infiniband

Control FPGA

FPGA +

4x 4GB DDR2 DRAM+

4x MGT

138

Control FPGA

Control FPGA

FPGA +

FPGA +

4x 4GB DDR2

4x 4GB DDR2

DRAM+

DRAM+

4x MGT

4x MGT

4x IB4X Infiniband

4x IB4X Infiniband

Figure 6. BEE2 Compute Module block diagram. Compute modules can be interconnected via the Infiniband IB4X connectors, either directly or via a 10-Gigabit Ethernet switch. The 100-Base T Ethernet can be used for control, monitoring, or data archiving.

Figure 7. Dynamic Architecture for FPGA-based systems. The architecture contains a functional block and an interconnect network. The interconnect and the logic can be time shared. The emulated design topology is levelized in a folded pipeline manner. The levelized topology simplifies the architecture with predictable interconnect delay.

was shown that hardware in the hardware task area can be shared by tasks and the overheads associated with its implementation on a partially configurable platform were acceptably low.

A pipeline stage is often a convenient block-level unit for reconfiguration. In incremental pipeline reconfigura­tion (45), an application with S pipeline stages can be im­plemented in an FPGA with fewer than S physical pipeline stages. This is done by adding one pipeline stage and re­moving one pipeline stage in each stage of the computation. Execution and computation can be overlapped.

Runtime reconfiguration can be done at even lower lev­els. A crossbar switch which employs runtime reconfigu­ration of the FPGA’s routing resources has been described

(46) . This scheme was able to achieve density, switch up­date latency and performance higher than possible using conventional means.

Tools have been developed to support runtime reconfig­uration. For example, JBits (47) is a set of Java classes that provide an application programming interface to the Xil – inx FPGA bitstream. The interface operates on either bit­streams generated by Xilinx design tools or on bitstreams read back from actual hardware and allows the FPGA logic and routing resources to be modified.

SYSTEM ARCHITECTURES

Reconfigurable computing machines are constructed by in­terconnecting one or more FPGAs. Functionally, we can view FPGA-based systems as consisting of two compo­nents, reprogrammable FPGAs providing logic implemen­tation and field programmable interconnect chips (FPICs) providing connectivity among FPGAs. The FPICs, in turn, could be implemented as ASICs or using FPGAs. Most sys­
tems include other elements, such as microprocessors and storage, and can be treated as processing elements and memory that are interconnected. Obviously, the arrange­ment of these elements affects the system performance and routability.

The simplest topology involves FPGAs directly con­nected in a ring, mesh, or other fixed pattern. FPGAs serve as both logic and interconnect, providing direct commu­nication between adjacent devices. Such an architecture is predicated on locality in the circuit design and further assumes that the circuit design maps well to the planar mesh. This architecture fits well for applications with reg­ular local communications (30). However, in general, high performance is hard to obtain for arbitrary communication patterns because the architecture only provides direct com­munications between neighboring FPGAs and two distant FPGAs may need many other devices as “hops” to commu­nicate, resulting in long and widely variable delays. Fur­thermore, FPGAs, when used as interconnects, often result in poor timing characteristics.

A major change in the architecture of FPGA-based sys­tems was the concept of a partial crossbar interconnect, as in Realizer (26) and BORG (31). This scheme is common in logic emulation systems. Interconnection through FPICs implies that all pairs of FPGAs are neighbors, resulting in predictable interconnect delays, better timing characteris­tics, and better overall system performance (32,33). Figure 4, from Reference (26) depicts a reconfigurable computing system designed for logic emulation. Arrays of reconfig – urable processors and FPICs, both implemented using FP – GAs, reside on the emulation modules. The user inputs the emulated design netlist and commands from the worksta­tion. The workstation and control processor personalize the emulation module, which are used in place ofthe emulated chip. Thus, the target system can function properly before the actual chip is available. Furthermore, testing and de­sign change can be made by modifying software instead of reworking hardware.

Figure 5 depicts the SPLASH 2 architecture (34). Each board contains 16 FPGAs, X1 through X16. The blocks M1 through M16 are local memories of the FPGAs. A simplified 36-bit bus crossbar, with no permutation of the bit-lines within each bus, interconnects the 16 FPGAs. Another 36- bit bus connects the FPGAs in a linear systolic fashion. The local memories are dual ported with one port connecting to the FPGAs and the other port connecting to the external bus. It is interesting to note that the crossbar was added to the SPLASH 2 machine, the original SPLASH 1 machine only having the linear connections. SPLASH 2 has been successfully used for custom computing applications such as search in genetic databases and string matching (20).

Other designs have used a hierarchy of interconnect schemes, differing in performance. The use of multi-gigabit transceivers (MGT) available on contemporary FPGAs al­lows high bandwidth interconnection using commodity components. An example is the Berkeley Emulation En­gine 2 (BEE2) (22), designed for reconfigurable computing and illustrated in Fig. 6. Each compute module consists of five FPGAs (Xilinx XC2VP70) connected to four double data rate 2 (DDR2) dual inline memory modules (DIMMs) with a maximum capacity of 4GB per FPGA. Four FP-

GAs are used for computation and one for control. Each PPGA has two PowerPC 405 processor cores. A local mesh connects the computation FPGAs in a 2-D grid using low – voltage CMOS (LVCMOS) parallel signaling. Off-module communications are of via 18 (two from the control FPGA and four from each of the compute FPGAs) Infiniband 4X channel-bonded 2.5-Gbps connectors that operate full – duplex, which corresponds to a 180-Gbps off-module full – duplex communication bandwidth. Modules can be inter­connected in different topologies including tree, 3-D mesh, or crossbar. The use of standard interfaces allows standard network switches such as Infiniband and 10-Gigabit Eth­ernet to be used. Finally, a 100 base-T Ethernet connection to the control FPGA is present for out-of-band communica­tions, monitoring, and control.

Commercial machines such as the Cray XD1 (35), SRC SRC-7 (36), and Silicon Graphics RASC blade (37), have a similar interconnect structure to the BEE2 in that they are parallel machines employing high performance micro­processors tightly coupled to a relatively small number of FPGA devices per node. Nodes are interconnected via high speed switches and for specialized applications, such ma­chines can have orders ofmagnitude performance improve­ment over conventional architectures. Switching topolo­gies can be altered via configuration of the switching fabric.

APPLICATIONS

Reconfigurable computing has found widespread applica­tion in the form of “custom computing-machines” for high – energy physics (19), genome analysis (20), signal process­ing (21,22), cryptography (7,23), financial engineering (24) and other domains (25). It is unique in that the flexibility of the fabric allows customization to a degree not feasible in an ASIC. For example, in an FPGA-based implementation of RSA cryptography (23), a different hardware modular multiplier for each prime modulus was employed (i. e., the modulus was hardwired in the logic equations of the de­sign). Such an approach would not be practical in an ASIC as the design effort and cost is too high to develop a differ­ent chip for different moduli. This led to greatly reduced hardware and improved performance, the implementation being an order of magnitude faster than any reported im­plementation in any technology at the time.

Another important application is logic emulation (26, 27) where reconfigurable computing is used not only for simulation acceleration, but also for prototyping of ASICs and in-circuit emulation. In-circuit emulation allows the possibility of testing prototypes at full or near-full speed, allowing more thorough testing of time-dependent applica­tions such as networks. It also removes many of the depen­dencies between ASIC and firmware development, allow­ing them to proceed in parallel and hence shortening devel­opment time. As an example, it was used in Reference (28) for the development of a two-million-gate ASIC containing an IEEE 802.11 medium access controller and IEEE 802.1 la/b/g physical layer processor. Using a reconfigurable pro­totype of the ASIC on a commodity FPGA board, the ASIC went through one complete pass of real-time beta testing before tape-out.

Digital logic, of course, maps extremely well to fine­grained FPGA devices. The main design issues for such systems lie in partitioning of a design among multiple FPGAs and dealing with the interconnect bottleneck be­tween chips. The Cadence Palladium II emulator (29) is a commercial example of a logic emulation system and has 256-million-gate logic capacity and 74-GB memory capac­ity. It uses custom ASICs optimized for logic emulation and is 100-10,000 times faster than software-based reg­ister transfer language simulation. Further discussion of interconnect time-multiplexing and system decomposition is given later in this article.

Hoang (20) implemented algorithms to find minimum edit distances for protein and DNA sequences on the Splash 2 architecture. Splash 2 can be modeled in terms of both bidirectional and unidirectional systolic arrays. In the bidirectional algorithm, the source character stream is fed to the leftmost processing element (PE), whereas the target stream is fed to the rightmost PE. Compar­ing two sequences of length m and n requires at least 2 x max(m + 1, n + 1) processors, and the number of steps required to compute the edit distance is proportional to the size of the array. The unidirectional algorithm is suited for comparing a single source sequence against multiple tar­get sequences. The source sequence is first loaded as in the bidirectional case, and the target sequences are fed in one after the other and processed as they pass through the PEs (which results in virtually 100% utilization of processors, so that the unidirectional model is better suited for large database searches).

The BEE2 system (22), described in the next section, was applied to the radio astronomy signal processing domain, which included development of a billion-channel spectrom­eter, a 1024-channel polyphase filter banks, and a two – input, 1024-channel correlator. The FPGA-based system used a 130-nm technology FPGA and performance was compared with 130-and 90-nm DSP chips as well as a 90­nm microprocessor. Performance in terms of computational throughput per chip was found to be a factor of 10 to 34 over the DSP chip in 130-nm technology and 4 to 13 times bet­ter than the microprocessor. In terms of power efficiency, the FPGA was one order of magnitude better than the DSP and two orders of magnitude better than the microproces­sor. Compute throughput per unit chip cost was 20-307% better than the 90-nm DSP and 50-500% better than the microprocessor.