MULTICHIP SYSTEMS

Special care must be taken in the design of large and mul­tichip reconfigurable systems. In this section, we describe some theoretic results relevant to the major architectural and issues associated with such designs.

Interconnect Organization

Figure 8. Hardware compilation example. The C program is translated into a datapath (top) and control (bottom). Execution of statements in the while loop are controlled by s1 and s2; s0 and s3 correspond to the start signals of the statements before and after the while loop.

Figure 9. Clos network. A Clos network contains three stages: inputs, intermediate switches, and outputs. The input and output stages are symmetric. In the figure, the first-stage has r n x m switches, the second-stage has mr x r switches, and the third-stage has r m x n switches.

A classic Clos network (57) contains three stages: inputs, intermediate switches, and outputs, as shown in Fig. 9. It can be used to interconnect pins in a reconfigurable com­puting system, and its input and output stages are symmet­ric. Suppose the first stage has r nx m crossbar switches, the second stage has m r x r switches, and the third stage has rmx n switches, let us denote the network as c(n, m,r). For any two-pin net interconnect requirement, the network c(n, m, r) can achieve complete routability if m is not less than n. The routing method can be described by recursive operations (58). In the first iteration, we reduce the net­work to c(n — 1, m — 1, r). In the ith iteration, we reduce the network to c(n — i, m — i, r). When n — і = 1, we have r 1 x (m — n + 1) switches in the first stage, m — n + 1 r x r
switches in the second stage, and r(m — n + 1) x 1 switches in the third stage. In other words, only one input exists in each first-stage switch and one output in each third-stage switch. In this case, one second-stage r x r switch is enough to route the r inputs of r first-stage switches to the r outputs of r third-stage switches, thus completing the interconnect.

The reduction from c(n — i, m — i, r) to c(n — і — 1,m — і — 1,r) can be derived by a maximum matching algorithm. The matching algorithm selects disjoint signals from dif­ferent input switches to different output switches. One second-stage switch is then used to route the selected sig­nals. From Hall’s theorem, the maximum matching and

routing can always reduce the network to c(n — і — 1,m —

і — 1, r).

Conceptually, the routing problem can also be formu­lated as edge coloring on a bipartite graph G(V1, V2, E) (31). The node sets V1 and V2 represent the switches in the input and output stages, respectively. An edge in E represents a two-pin net interconnect requirement between the cor­responding input and output switches. In Reference (31), Chan and Schlag assigned colors to the edges of the bi­partite graph. Edges of the same color are bundled into one group and the corresponding set of nets are routed by one switch in the second stage. The work of Reference (59) was then used to find a minimum edge coloring solution in O(|E|logn).

The three-stage Clos network can be folded into a two — stage network (Fig. 10) so that the inputs and outputs are mixed in the first stage. Thus, the corresponding bipartite graph G(Vi, V2, E) constructed above for edge coloring is also folded with V1 and V2 merged into one set.

To find the routing assignment, the folded edge color­ing graph can be unfolded back to a bipartite graph using an Euler path search. The Euler path traverses every edge exactly once and defines the edge direction according to the direction of the traversal. We then recover the origi­nal bipartite graph by splitting the node set back into two sets V1 and V2 and unfold the edges such that all edges are directed from V1 to V2. We can find the minimum edge coloring solution of the unfolded bipartite graph and apply the solution back to the folded routing problem.

In practice, the first-level crossbar of the Clos network is replaced with FPGAs to save board space (Fig. 11). Routability is worse than an ideal Clos network. Even with a true Clos network, complete routability of multipin nets is not guaranteed, which is an important practical consid­eration because in microelectronic design, many multipin nets typically exist.

In an attempt to solve the multipin net and routability problem, we can introduce extra connections among FPIDs as shown in Fig. 12. However, extra FPID interconnections also incur extra delay. We can also expand the fanout width of FPGAs so that each FPGA I/O pin is connected to more than one FPIC (60, 61). The fanout width expansion im­proves routability without significant additional delay. The multiple appearances of I/O pins increase the probability that a signal connection can be made in a single stage, which is especially critical for multipin nets. However, the additional fanouts increase the needed pin count of FPICs. Thus, we need to find a balanced fanout distribution that reduces the interconnect delay with a minimal pin require­ment.

A tree-structured network can simplify the mapping process for certain applications. In Reference (62), an ex­ample of a tree-structured network is illustrated for a Very Large Scale Simulator (VLSS). The VLSS tree structure has all logic components located at the leaves and intercon­nect switches at the internal nodes. The machine covers a capacity of eight million gates. Each branch is an 8-bit bus. The higher up the level of the tree, the less parallelism the signal distribution can achieve. Therefore, a partitioning process is designed to minimize the high level interconnect and maximize the parallel operation.

Interconnect Multiplexing

Time multiplexing is an effective method for tackling the scalability problem in interconnecting large designs. The time-sharing method can be extended from traditional bus organization (27, 62) to network sharing (63) and further to function block sharing (40).

Interconnect can be time shared as a bus (27, 62). If n communication lines exist between two FPGAs, they can be reduced to a single line by storing logical outputs in shift registers and time-multiplexing the communication in phases. Such a scheme was employed in the virtual wires logic emulation system (27), which is efficient because in­terconnects are normally capable of being clocked at much higher rates than the critical path ofthe rest ofthe system, and all logical wires are not simultaneously active. This scheme can greatly reduce the complexity of the intercon­necting network or printed circuit board in a multi-FPGA system.

Li and Cheng (63) proposed that a dynamic network be viewed as overlapping L conventional FPICs together but sharing the same I/O pins. A dynamic routing architec­ture can increase the routability and shorten interconnect length. Each switching network is a full crossbar, which can be reconfigured to provide any connections among I/O pins. The select lines are used to activate only one switching net­work at a time; thus the I/O pins are dynamically connected according to the configuration of this active switching net­work. By dynamically reconfiguring the FPICs, L logic sig­nals can time-share the same interconnect resources.

Memory Allocation

Interconnect schemes should also consider how memory is connected to the FPGAs. Although combining memory with logic in the same FPGA is the most desirable method for reducing routing congestion and signal delay, separate components can supply much larger capacity at higher den­sity and lower price. Figure 13 demonstrates three differ­ent ways of allocating the memories in a Clos network (31, 64). The memory may be attached directly to a local FPGA (Fig. 13a), attached to the second-stage switches ofthe Clos network via a host interface (Fig. 13b), or attached to the first-stage switches ofthe Clos network (Fig. 13c). The first method provides good performance for local memory ac­cess. However, for the case of nonlocal memory access, the routability and delay are concerns. The second method is slower than the first method for local memory accesses but provides better routability. The third is the most flexible as the memory is attached to the network and the routabil — ity is high. However, every logic-to-memory communication must go through the second interconnect stage.

Bus Buffer Insertion

In FPGAs, signal propagation is inherently slow because of its programmable interconnect feature. However, the delay of long routing wires can be drastically reduced by buffer insertion. The principle at work is that by inserting buffers we can decouple capacitive effects of components and in­terconnect driven by the buffers and thereby improve RC delay.

Given a routing topology for a net and timing require­ments for its sinks, an efficient optimal buffer insertion algorithm was proposed in Reference 65. Experimental re­sults show dramatic improvement versus the unbuffered solution. Thus, it is advantageous to have abundant buffers in FPGAs. However, each possible buffer and its pro­grammable switch adds capacitance to the wires, which in turn will contribute to delay. Thus, a balance point needs to be identified to trade off between the additional delay and capacitance of the buffers versus the improvement they can provide.

Figure 10. Folded Clos network. The three-stage Clos network is folded into a two-stage network so that the inputs and outputs are mixed in the first stage.

Figure 11. Variations of the Clos network. The first level crossbar of the Clos network is replaced with FPGAs to save board space. Routability is worse than an ideal Clos network.

FPGA

FPGA

FPGA

FPGA

FPID

FPID

FPID

FPID

Figure 12. Variations of Clos network. The fanout width of FPGAs is expanded so that each FPGA I/O pin is connected to more than one FPIC. The fanout width expansion improves routability without significant additional delay.

For a multisourced bus, the problem of buffer insertion becomes more complicated, because the optimization for one source may sacrifice the delay of others. Furthermore, the direction of the buffer needs to be arbitrated by a con­troller. Instead of using such a controller, a novel approach is to use a patented open collector bus repeater (66). When idle, the two ends of the repeater are set to high. When the repeater senses the pull-down action on one side, it presents the signal on the other side until the pull-down action is released from the originated signal. The bus re­peater eliminates the need for a direction control signal, resulting in a simpler design and better use of resources.

System Decomposition

To decompose a system into multiple devices, Yeh et al. (67) proposed an algorithm based on the relationship between uniform multi-commodity flow and min-cut partitioning. Yeh et al. construct a flow network wherein each net ini­tially corresponded an edge with flow cost one. Two random modules in the network were chosen and the shortest path (i. e., path with lowest cost) between them was computed. A constant A < 1 was added to the flow for each net in the shortest path, and the cost for every net in the path was incremented. Adjusting the cost penalizes paths through congested areas and forces alternative shortest paths. This random shortest path computation is repeated until every path between the chosen pair of modules passes through at least one “saturated” net. The set of saturated nets induces a multi-way partitioning in which two modules belong to the same cluster if and only if there is a path of unsaturated nets between them.

For each of these clusters, the flux (defined as the cut — size between the cluster and its complement, divided by the size of the cluster) is computed and the clusters are sorted based on their flux value. Yeh et al. began with a single cluster equal to the entire netlist, and then peeled off the clusters with lowest flux. This approach was attrac­tive because the saturated nets are good candidates to be cut in a partitioning solution. As peeled clusters can be very small, a second phase may be used to make the multi-way partitioning more balanced. This approach, with its sub­sequent speedup by Yeh (68), is well-suited for large-scale multi-way partitioning instances.

The system prototyping phase may also explore netlist transformations such as logic replication and retiming to minimize cut size (I/O usage) or system cycle time. Such transformations are needed as inter-device delays can be relatively large and because devices are often I/O-limited. In Reference (69), Liu et al. proposed a partitioning algo­rithm that permits logic replication to minimize both cut size and clock cycle of sequential circuits. Given a netlist G = (V, E), their approach chooses two modules as seeds s and t, then constructs a “replication graph” that is twice the size of the original circuit. This graph has the spe­cial property that a type of directed minimum cut yields the replication cut (i. e., a decomposition of V into S, T, and R = V — S — T where s є S, t є T and R is the replicated logic) that is optimal. A directed version of the Fiduccia — Mattheyses algorithm is used to find a heuristic directed minimum cut in the replication graph. Cong et al. (70) present an efficient algorithm for the performance-driven multi-way circuit partitioning problem that considers the different local and global interconnect delay introduced by the partitioning.

Alpert and Kahng (71) survey the FPGA partition­ing literature in the context of major graph partitioning paradigms. The current partitioning problems are 1) low usage rate of FPGA gate capacity because I/O pin limit, 2) low clock rate because of interconnect delay between mul­tiple FPGAs and 3) long CPU time for the mapping process.

System Planning and Design Changes

For a given system decomposition to be implemented on a multi-FPGA prototyping architecture, all connections within each device and between devices must be routable. Chan et al. (72) invoke much literature on routability pre­diction in gate arrays, as well as theoretical concepts, such as the Rent parameter, to obtain a fast routability esti­mate for arbitrary netlists and FPGA architectures. Their method ascribes one of three levels of routable (easily routable, marginally routable, or unroutable) to a netlist based on various parameters. Specifically, combining a wirelength estimator due to Feuer, the average number of pins-per-cell, and the estimated Rent parameter yields a relatively accurate routability predictor. The utility of these parameters is contrasted with that of other criteria such as El Gamal’s channel width requirement (73) or the

(a)

Ґ >

X

Logic

X

X

V J

fe

X:

Ґ N

X

X

Logic

X

v )

X

outer stage

к ^

/

//V

/

/ / > \

/X|

Figure 13. Memory organization, (a) Memory is attached directly to a local FPGA. (b) Memory is attached to the second-stage switches of the Clos network via a host interface, (c) Memory is attached to the first-stage switches of the Clos network.

average pins-per-net ratio.

In addition to routability, connections must also meet system timing constraints. Selvidge et al. (74) ex­tend the original virtual wires (27) concept in their TIERS (Topology-IndEpendent Routing and Scheduling) approach. The problem formulation assumes that an as­signment from a multiple-FPGA partitioning (i. e., a design graph) to a target topology graph has already been made. The objective is to assign “links” (i. e., signal nets) to chan­nels between devices; as with the Virtual Wires concept, specific timeslices for a channel can be assigned to multi­ple links as long as no two links need to transmit signals at the same time. The TIERS algorithm uses a greedy method to order the links and then routes each link in the sched­uled order while reserving channel resources; factors of up to 2.5 improvement in system cycle time are achieved.

Chang, et al. (75) address the combined issues of routability and system timing by applying layout-driven logic resynthesis techniques. For a given wire that cannot be routed, “alternative wires” and alternative functions are identified, such that the given unroutable wire can be re­moved from the circuit and replaced with a new wire (or wires) or new logic without affecting functionality. Cheng et al. estimate that between 30% and 50% of wires have so- called “triple-wire alternatives” (i. e., replacements consist­ing of three or fewer wires). Their method first routes the wires that do not have any alternatives then replaces any unroutable wire with available alternatives. System tim­ing can be improved by replacing long wires with shorter alternatives.

CONCLUSION

Reconfigurable computing offers a middle ground between software-based systems and ASIC implementations, and is often able to combine important benefits of both. Im­plementations are able to avoid overheads such as unnec­essary data transfers, decoding and control mandatory in microprocessors, and designs can be optimized on a basis specific to an application, a problem instance or even an execution. Using this technology, it is possible to achieve size, performance, cost, or power improvements over more conventional computing technologies.

ACKNOWLEDGMENTS

The authors would like to thank Y M. Lam for his help in preparing this manuscript and Prof. Wayne Luk (Imperial College) for his proofreading of this article.

[1] K OPERATION

The relatively low Tc of Nb-Ti allows means that consid­erable gains in Hc2 and and Jc can be gained from low­ering the temperature of operation using superfluid liq­uid He, typically at 1.8 to 2 K. This has been recently exploited by the Large Hadron Collider project at CERN, which operates at 1.9 K so that the magnetic field can be pushed beyond 8 T (71). Boutboul et al. (72) have shown

[2] Mechanical breaker (air, air blast, vacuum, vacuum and magnetic field)

[3] Water cooled fuses (activated by water flow interrup­tion)

[4] Explosively actuated breaker (fuse and fuseless)

[5]3 < Cf (Tc — Tb )[1 + f (ht, K, C, p, • . •)]

[6] “Thermal” and “nonthermal ” are used inconsistently in the bio­

effects literature, at times referring to mechanism for producing an effect, and at times (even in the same paper) referring to the presence or absence of what the investigator considers to be a sig­nificant change in temperature. Some investigators use the term athermal to indicate a biological effect that occurs in the absence of noticeable heating, but which is still produced by thermal mech­anisms. An example of an athermal effect might be a physiologi­cal response caused by thermoregulation, although no noticeable change in body temperature occurs.

[8] A walk in a graph G = (V, E) is a sequence of nodes w = [v1,v2,.. .vk], k > 1, such that ( j, j + 1) є E, j = 1, 2,.. .k — 1. A walk is closed if k > 1 and v1 = vk.

Добавить комментарий

Ваш e-mail не будет опубликован. Обязательные поля помечены *