MULTICHIP SYSTEMS
Special care must be taken in the design of large and multichip reconfigurable systems. In this section, we describe some theoretic results relevant to the major architectural and issues associated with such designs.
Figure 8. Hardware compilation example. The C program is translated into a datapath (top) and control (bottom). Execution of statements in the while loop are controlled by s1 and s2; s0 and s3 correspond to the start signals of the statements before and after the while loop. 
Figure 9. Clos network. A Clos network contains three stages: inputs, intermediate switches, and outputs. The input and output stages are symmetric. In the figure, the firststage has r n x m switches, the secondstage has mr x r switches, and the thirdstage has r m x n switches. 
A classic Clos network (57) contains three stages: inputs, intermediate switches, and outputs, as shown in Fig. 9. It can be used to interconnect pins in a reconfigurable computing system, and its input and output stages are symmetric. Suppose the first stage has r nx m crossbar switches, the second stage has m r x r switches, and the third stage has rmx n switches, let us denote the network as c(n, m,r). For any twopin net interconnect requirement, the network c(n, m, r) can achieve complete routability if m is not less than n. The routing method can be described by recursive operations (58). In the first iteration, we reduce the network to c(n — 1, m — 1, r). In the ith iteration, we reduce the network to c(n — i, m — i, r). When n — і = 1, we have r 1 x (m — n + 1) switches in the first stage, m — n + 1 r x r
switches in the second stage, and r(m — n + 1) x 1 switches in the third stage. In other words, only one input exists in each firststage switch and one output in each thirdstage switch. In this case, one secondstage r x r switch is enough to route the r inputs of r firststage switches to the r outputs of r thirdstage switches, thus completing the interconnect.
The reduction from c(n — i, m — i, r) to c(n — і — 1,m — і — 1,r) can be derived by a maximum matching algorithm. The matching algorithm selects disjoint signals from different input switches to different output switches. One secondstage switch is then used to route the selected signals. From Hall’s theorem, the maximum matching and
routing can always reduce the network to c(n — і — 1,m —
і — 1, r).
Conceptually, the routing problem can also be formulated as edge coloring on a bipartite graph G(V1, V2, E) (31). The node sets V1 and V2 represent the switches in the input and output stages, respectively. An edge in E represents a twopin net interconnect requirement between the corresponding input and output switches. In Reference (31), Chan and Schlag assigned colors to the edges of the bipartite graph. Edges of the same color are bundled into one group and the corresponding set of nets are routed by one switch in the second stage. The work of Reference (59) was then used to find a minimum edge coloring solution in O(Elogn).
The threestage Clos network can be folded into a two — stage network (Fig. 10) so that the inputs and outputs are mixed in the first stage. Thus, the corresponding bipartite graph G(Vi, V2, E) constructed above for edge coloring is also folded with V1 and V2 merged into one set.
To find the routing assignment, the folded edge coloring graph can be unfolded back to a bipartite graph using an Euler path search. The Euler path traverses every edge exactly once and defines the edge direction according to the direction of the traversal. We then recover the original bipartite graph by splitting the node set back into two sets V1 and V2 and unfold the edges such that all edges are directed from V1 to V2. We can find the minimum edge coloring solution of the unfolded bipartite graph and apply the solution back to the folded routing problem.
In practice, the firstlevel crossbar of the Clos network is replaced with FPGAs to save board space (Fig. 11). Routability is worse than an ideal Clos network. Even with a true Clos network, complete routability of multipin nets is not guaranteed, which is an important practical consideration because in microelectronic design, many multipin nets typically exist.
In an attempt to solve the multipin net and routability problem, we can introduce extra connections among FPIDs as shown in Fig. 12. However, extra FPID interconnections also incur extra delay. We can also expand the fanout width of FPGAs so that each FPGA I/O pin is connected to more than one FPIC (60, 61). The fanout width expansion improves routability without significant additional delay. The multiple appearances of I/O pins increase the probability that a signal connection can be made in a single stage, which is especially critical for multipin nets. However, the additional fanouts increase the needed pin count of FPICs. Thus, we need to find a balanced fanout distribution that reduces the interconnect delay with a minimal pin requirement.
A treestructured network can simplify the mapping process for certain applications. In Reference (62), an example of a treestructured network is illustrated for a Very Large Scale Simulator (VLSS). The VLSS tree structure has all logic components located at the leaves and interconnect switches at the internal nodes. The machine covers a capacity of eight million gates. Each branch is an 8bit bus. The higher up the level of the tree, the less parallelism the signal distribution can achieve. Therefore, a partitioning process is designed to minimize the high level interconnect and maximize the parallel operation.
Time multiplexing is an effective method for tackling the scalability problem in interconnecting large designs. The timesharing method can be extended from traditional bus organization (27, 62) to network sharing (63) and further to function block sharing (40).
Interconnect can be time shared as a bus (27, 62). If n communication lines exist between two FPGAs, they can be reduced to a single line by storing logical outputs in shift registers and timemultiplexing the communication in phases. Such a scheme was employed in the virtual wires logic emulation system (27), which is efficient because interconnects are normally capable of being clocked at much higher rates than the critical path ofthe rest ofthe system, and all logical wires are not simultaneously active. This scheme can greatly reduce the complexity of the interconnecting network or printed circuit board in a multiFPGA system.
Li and Cheng (63) proposed that a dynamic network be viewed as overlapping L conventional FPICs together but sharing the same I/O pins. A dynamic routing architecture can increase the routability and shorten interconnect length. Each switching network is a full crossbar, which can be reconfigured to provide any connections among I/O pins. The select lines are used to activate only one switching network at a time; thus the I/O pins are dynamically connected according to the configuration of this active switching network. By dynamically reconfiguring the FPICs, L logic signals can timeshare the same interconnect resources.
Interconnect schemes should also consider how memory is connected to the FPGAs. Although combining memory with logic in the same FPGA is the most desirable method for reducing routing congestion and signal delay, separate components can supply much larger capacity at higher density and lower price. Figure 13 demonstrates three different ways of allocating the memories in a Clos network (31, 64). The memory may be attached directly to a local FPGA (Fig. 13a), attached to the secondstage switches ofthe Clos network via a host interface (Fig. 13b), or attached to the firststage switches ofthe Clos network (Fig. 13c). The first method provides good performance for local memory access. However, for the case of nonlocal memory access, the routability and delay are concerns. The second method is slower than the first method for local memory accesses but provides better routability. The third is the most flexible as the memory is attached to the network and the routabil — ity is high. However, every logictomemory communication must go through the second interconnect stage.
In FPGAs, signal propagation is inherently slow because of its programmable interconnect feature. However, the delay of long routing wires can be drastically reduced by buffer insertion. The principle at work is that by inserting buffers we can decouple capacitive effects of components and interconnect driven by the buffers and thereby improve RC delay.
Given a routing topology for a net and timing requirements for its sinks, an efficient optimal buffer insertion algorithm was proposed in Reference 65. Experimental results show dramatic improvement versus the unbuffered solution. Thus, it is advantageous to have abundant buffers in FPGAs. However, each possible buffer and its programmable switch adds capacitance to the wires, which in turn will contribute to delay. Thus, a balance point needs to be identified to trade off between the additional delay and capacitance of the buffers versus the improvement they can provide.
Figure 10. Folded Clos network. The threestage Clos network is folded into a twostage network so that the inputs and outputs are mixed in the first stage.
Figure 11. Variations of the Clos network. The first level crossbar of the Clos network is replaced with FPGAs to save board space. Routability is worse than an ideal Clos network.
FPGA 
FPGA 
FPGA 
FPGA 

FPID 
FPID 
FPID 
FPID 

Figure 12. Variations of Clos network. The fanout width of FPGAs is expanded so that each FPGA I/O pin is connected to more than one FPIC. The fanout width expansion improves routability without significant additional delay.
For a multisourced bus, the problem of buffer insertion becomes more complicated, because the optimization for one source may sacrifice the delay of others. Furthermore, the direction of the buffer needs to be arbitrated by a controller. Instead of using such a controller, a novel approach is to use a patented open collector bus repeater (66). When idle, the two ends of the repeater are set to high. When the repeater senses the pulldown action on one side, it presents the signal on the other side until the pulldown action is released from the originated signal. The bus repeater eliminates the need for a direction control signal, resulting in a simpler design and better use of resources.
System Decomposition
To decompose a system into multiple devices, Yeh et al. (67) proposed an algorithm based on the relationship between uniform multicommodity flow and mincut partitioning. Yeh et al. construct a flow network wherein each net initially corresponded an edge with flow cost one. Two random modules in the network were chosen and the shortest path (i. e., path with lowest cost) between them was computed. A constant A < 1 was added to the flow for each net in the shortest path, and the cost for every net in the path was incremented. Adjusting the cost penalizes paths through congested areas and forces alternative shortest paths. This random shortest path computation is repeated until every path between the chosen pair of modules passes through at least one “saturated” net. The set of saturated nets induces a multiway partitioning in which two modules belong to the same cluster if and only if there is a path of unsaturated nets between them.
For each of these clusters, the flux (defined as the cut — size between the cluster and its complement, divided by the size of the cluster) is computed and the clusters are sorted based on their flux value. Yeh et al. began with a single cluster equal to the entire netlist, and then peeled off the clusters with lowest flux. This approach was attractive because the saturated nets are good candidates to be cut in a partitioning solution. As peeled clusters can be very small, a second phase may be used to make the multiway partitioning more balanced. This approach, with its subsequent speedup by Yeh (68), is wellsuited for largescale multiway partitioning instances.
The system prototyping phase may also explore netlist transformations such as logic replication and retiming to minimize cut size (I/O usage) or system cycle time. Such transformations are needed as interdevice delays can be relatively large and because devices are often I/Olimited. In Reference (69), Liu et al. proposed a partitioning algorithm that permits logic replication to minimize both cut size and clock cycle of sequential circuits. Given a netlist G = (V, E), their approach chooses two modules as seeds s and t, then constructs a “replication graph” that is twice the size of the original circuit. This graph has the special property that a type of directed minimum cut yields the replication cut (i. e., a decomposition of V into S, T, and R = V — S — T where s є S, t є T and R is the replicated logic) that is optimal. A directed version of the Fiduccia — Mattheyses algorithm is used to find a heuristic directed minimum cut in the replication graph. Cong et al. (70) present an efficient algorithm for the performancedriven multiway circuit partitioning problem that considers the different local and global interconnect delay introduced by the partitioning.
Alpert and Kahng (71) survey the FPGA partitioning literature in the context of major graph partitioning paradigms. The current partitioning problems are 1) low usage rate of FPGA gate capacity because I/O pin limit, 2) low clock rate because of interconnect delay between multiple FPGAs and 3) long CPU time for the mapping process.
System Planning and Design Changes
For a given system decomposition to be implemented on a multiFPGA prototyping architecture, all connections within each device and between devices must be routable. Chan et al. (72) invoke much literature on routability prediction in gate arrays, as well as theoretical concepts, such as the Rent parameter, to obtain a fast routability estimate for arbitrary netlists and FPGA architectures. Their method ascribes one of three levels of routable (easily routable, marginally routable, or unroutable) to a netlist based on various parameters. Specifically, combining a wirelength estimator due to Feuer, the average number of pinspercell, and the estimated Rent parameter yields a relatively accurate routability predictor. The utility of these parameters is contrasted with that of other criteria such as El Gamal’s channel width requirement (73) or the
(a) 
Ґ > 
X 

Logic 
X 

X 

V J 
fe 
X: 
Ґ N 
X 

X 

Logic 
X 

v ) 
X 
outer stage
к ^ 

/ 

//V 

/ 

/ / > \ 

/X 

Figure 13. Memory organization, (a) Memory is attached directly to a local FPGA. (b) Memory is attached to the secondstage switches of the Clos network via a host interface, (c) Memory is attached to the firststage switches of the Clos network.
average pinspernet ratio.
In addition to routability, connections must also meet system timing constraints. Selvidge et al. (74) extend the original virtual wires (27) concept in their TIERS (TopologyIndEpendent Routing and Scheduling) approach. The problem formulation assumes that an assignment from a multipleFPGA partitioning (i. e., a design graph) to a target topology graph has already been made. The objective is to assign “links” (i. e., signal nets) to channels between devices; as with the Virtual Wires concept, specific timeslices for a channel can be assigned to multiple links as long as no two links need to transmit signals at the same time. The TIERS algorithm uses a greedy method to order the links and then routes each link in the scheduled order while reserving channel resources; factors of up to 2.5 improvement in system cycle time are achieved.
Chang, et al. (75) address the combined issues of routability and system timing by applying layoutdriven logic resynthesis techniques. For a given wire that cannot be routed, “alternative wires” and alternative functions are identified, such that the given unroutable wire can be removed from the circuit and replaced with a new wire (or wires) or new logic without affecting functionality. Cheng et al. estimate that between 30% and 50% of wires have so called “triplewire alternatives” (i. e., replacements consisting of three or fewer wires). Their method first routes the wires that do not have any alternatives then replaces any unroutable wire with available alternatives. System timing can be improved by replacing long wires with shorter alternatives.
CONCLUSION
Reconfigurable computing offers a middle ground between softwarebased systems and ASIC implementations, and is often able to combine important benefits of both. Implementations are able to avoid overheads such as unnecessary data transfers, decoding and control mandatory in microprocessors, and designs can be optimized on a basis specific to an application, a problem instance or even an execution. Using this technology, it is possible to achieve size, performance, cost, or power improvements over more conventional computing technologies.
ACKNOWLEDGMENTS
The authors would like to thank Y M. Lam for his help in preparing this manuscript and Prof. Wayne Luk (Imperial College) for his proofreading of this article.
[1] K OPERATION
The relatively low Tc of NbTi allows means that considerable gains in Hc2 and and Jc can be gained from lowering the temperature of operation using superfluid liquid He, typically at 1.8 to 2 K. This has been recently exploited by the Large Hadron Collider project at CERN, which operates at 1.9 K so that the magnetic field can be pushed beyond 8 T (71). Boutboul et al. (72) have shown
[2] Mechanical breaker (air, air blast, vacuum, vacuum and magnetic field)
[3] Water cooled fuses (activated by water flow interruption)
[4] Explosively actuated breaker (fuse and fuseless)
[5]3 < Cf (Tc — Tb )[1 + f (ht, K, C, p, • . •)]
[6] “Thermal” and “nonthermal ” are used inconsistently in the bio
effects literature, at times referring to mechanism for producing an effect, and at times (even in the same paper) referring to the presence or absence of what the investigator considers to be a significant change in temperature. Some investigators use the term athermal to indicate a biological effect that occurs in the absence of noticeable heating, but which is still produced by thermal mechanisms. An example of an athermal effect might be a physiological response caused by thermoregulation, although no noticeable change in body temperature occurs.
[8] A walk in a graph G = (V, E) is a sequence of nodes w = [v1,v2,.. .vk], k > 1, such that ( j, j + 1) є E, j = 1, 2,.. .k — 1. A walk is closed if k > 1 and v1 = vk.
Добавить комментарий