Reconfigurable computing has found widespread applica­tion in the form of “custom computing-machines” for high — energy physics (19), genome analysis (20), signal process­ing (21,22), cryptography (7,23), financial engineering (24) and other domains (25). It is unique in that the flexibility of the fabric allows customization to a degree not feasible in an ASIC. For example, in an FPGA-based implementation of RSA cryptography (23), a different hardware modular multiplier for each prime modulus was employed (i. e., the modulus was hardwired in the logic equations of the de­sign). Such an approach would not be practical in an ASIC as the design effort and cost is too high to develop a differ­ent chip for different moduli. This led to greatly reduced hardware and improved performance, the implementation being an order of magnitude faster than any reported im­plementation in any technology at the time.

Another important application is logic emulation (26, 27) where reconfigurable computing is used not only for simulation acceleration, but also for prototyping of ASICs and in-circuit emulation. In-circuit emulation allows the possibility of testing prototypes at full or near-full speed, allowing more thorough testing of time-dependent applica­tions such as networks. It also removes many of the depen­dencies between ASIC and firmware development, allow­ing them to proceed in parallel and hence shortening devel­opment time. As an example, it was used in Reference (28) for the development of a two-million-gate ASIC containing an IEEE 802.11 medium access controller and IEEE 802.1 la/b/g physical layer processor. Using a reconfigurable pro­totype of the ASIC on a commodity FPGA board, the ASIC went through one complete pass of real-time beta testing before tape-out.

Digital logic, of course, maps extremely well to fine­grained FPGA devices. The main design issues for such systems lie in partitioning of a design among multiple FPGAs and dealing with the interconnect bottleneck be­tween chips. The Cadence Palladium II emulator (29) is a commercial example of a logic emulation system and has 256-million-gate logic capacity and 74-GB memory capac­ity. It uses custom ASICs optimized for logic emulation and is 100-10,000 times faster than software-based reg­ister transfer language simulation. Further discussion of interconnect time-multiplexing and system decomposition is given later in this article.

Hoang (20) implemented algorithms to find minimum edit distances for protein and DNA sequences on the Splash 2 architecture. Splash 2 can be modeled in terms of both bidirectional and unidirectional systolic arrays. In the bidirectional algorithm, the source character stream is fed to the leftmost processing element (PE), whereas the target stream is fed to the rightmost PE. Compar­ing two sequences of length m and n requires at least 2 x max(m + 1, n + 1) processors, and the number of steps required to compute the edit distance is proportional to the size of the array. The unidirectional algorithm is suited for comparing a single source sequence against multiple tar­get sequences. The source sequence is first loaded as in the bidirectional case, and the target sequences are fed in one after the other and processed as they pass through the PEs (which results in virtually 100% utilization of processors, so that the unidirectional model is better suited for large database searches).

The BEE2 system (22), described in the next section, was applied to the radio astronomy signal processing domain, which included development of a billion-channel spectrom­eter, a 1024-channel polyphase filter banks, and a two — input, 1024-channel correlator. The FPGA-based system used a 130-nm technology FPGA and performance was compared with 130-and 90-nm DSP chips as well as a 90­nm microprocessor. Performance in terms of computational throughput per chip was found to be a factor of 10 to 34 over the DSP chip in 130-nm technology and 4 to 13 times bet­ter than the microprocessor. In terms of power efficiency, the FPGA was one order of magnitude better than the DSP and two orders of magnitude better than the microproces­sor. Compute throughput per unit chip cost was 20-307% better than the 90-nm DSP and 50-500% better than the microprocessor.

Добавить комментарий

Ваш e-mail не будет опубликован. Обязательные поля помечены *