16
EURASIP Journal on Applied Signal Processing
Table 4: Architecture tradeoff exploration for covariance estima-
tion module.
Cycles |
1 |
2 |
3 |
4-8 |
9 |
__10 |
MU( a ) |
0 |
176 |
0 |
0 |
0 |
0 |
AD(a) |
0 |
0 |
136 |
0 |
0 |
0 |
MU(b) |
0 |
22 |
22 |
22 |
22 |
0 |
AD(b) |
0 |
0 |
17 |
17 |
17 |
17 |
Table 5: Architecture efficiency comparison for Catapult C versus
Xilinx IP core.
Architecture |
mult |
Cycles |
Slices |
Xilinx core |
12 |
128 |
2066 |
Catapult C Sol1 |
8 |
570 |
535 |
Catapult C Sol2 |
2 |
625 |
543 |
Catapult C Sol3 |
___________________1 |
810 |
551 |
(ACR). After the word length is adjusted by shifting, a
separate parallel-read-shuffle (PRS) module designed by
Catapult C reads the registers in parallel for [E0, ... , EL ]and
writes the memory and shuffles the Hermitian part [ELH ,
..., E1H]. Memory stalls are avoided and scalability is achieved
because it can stop at any chip to adjust to different update
rates.
In the PMS, the number of FUs is assigned according
to the time/area constraints. As an example for a (2 × 2)
case with L = 10, the VLSI area/time tradeoff is shown in
Table 4. The complexity is 176 multiplications and 136 ad-
ditions in each computation period. A typical manual de-
sign will layout 176 multipliers and 136 adders all in parallel.
This will take 4 cycles to complete the computation. How-
ever, the multipliers are in IDLE state for 9 cycles and wasted.
On the other extreme, an area-constraint solution will reuse
one multiplier and one adder, but has to take more than 176
cycles. The most area/time efficient architecture in 10 cy-
cles is to reuse 22 multipliers and 15 adders as the pipelined
operations. The multiplexing of so many multipliers in man-
ual RTL layout could be very difficult and time consuming.
Moreover, for a changed specification such as the chip rate or
clock rate, we can rapidly reschedule the design to meet the
real-time requirement by using the minimum hardware re-
source. The similar design method is applied for the FIR and
channel estimation.
6.2.2. Block-based MIMO-FFT IP cores
For the multiple FFTs in the tap solver, the keys for optimiza-
tion of the area/speed are loop unrolling, pipelining, and re-
source multiplexing. Although Xilinx provides FFT IP cores,
they are considerably large and much faster than required.
For example, a single v32FFT core in Xilinx CoreGen library
utilizes 12 multipliers and 2066 slices. Moreover, it is not
easy to apply the commonality by using the IP core for the
MIMO-FFTs. To achieve the best area/time tradeoff in differ-
ent situations, we design the customized MIMO-FFT mod-
ules to utilize the commonality in control logic and phase co-
efficient loading. Parallelism/pipelining in the parallel FFTs
are studied extensively in multilevels, for example, the BFU
level, the stage level, and the FFT-processor level. Catapult
C scheduled RTLs for 32-point FFTs with 16 bits are com-
pared with Xilinx v32FFT Core in Table 5 for a single FFT.
Catapult C design demonstrates much smaller size for differ-
ent solutions, for example, from solution 1 with 8 multipli-
ers and 535 slices to solution 3 with only one multiplier and
551 slices. Overall, solution 3 represents the smallest design
Table 6: The area/time specification of the major FPGA design
cores.
Architecture |
Latency |
CLB |
ASICMult |
Correlator |
1 chip |
22399 |
80 |
16-FFT32 |
43.1 μ s |
2530 |
4 |
32 MatInvMult(4 × 4) |
37.6 μ s |
4526 |
6 |
16-IFFT32 |
43.1 μ s |
2530 |
4 |
Overall tap solver |
123.8 μ s |
7109. 3 |
14 |
with slower but acceptable speed for a single FFT. For the
MIMO-FFT/IFFT modules, we can reuse the control logic in-
side the FFT module and schedule the number of FUs more
efficiently in the merged mode.
6.3. Prototyping implementation
Based on the above algorithmic and architectural optimiza-
tions, we have prototyped the VLSI architecture of a (4 × 4)
MIMO equalizer on the Aptix FPGA platform [27]. The cor-
relation window is set to 10 chips for all 4 receive anten-
nas. Fixed-point simulation shows that 8-bit input chip could
provide negligible performance loss. To give a safe range, the
input chip samples to both the corelator and the channel esti-
mator have 10-bit precision. The 32-point MIMO-FFT mod-
ule has 16-bit input word length for both the covariance and
channel coefficients. To support even faster fading speed, we
design the prototyping system for up to 4 updates per slot
with an overall tap-solving latency requirement of 125 mi-
croseconds. In Table 6, we give the specification of the ma-
jor design blocks. Overall, we utilize only 4 multipliers to
achieve area/time efficient design for 16 merged FFT/IFFT
modules. For the LF inverse of the (4 × 4) Hermitian sub-
matrices, the latency is 38 microseconds with 6 multipliers.
It is also noticed that the different modules have very similar
latency, which provides a very balanced pipelining in multi-
ple stages. The overall 124 microseconds meet the real-time
requirement very closely and give area efficiency. This effi-
ciency not only benefits from the afore-mentioned algorith-
mic and architectural optimization, but also from the exten-
sive design space exploration to find the most compact de-
sign by meeting the real-time requirement. The integration
of the MIMO equalizer into the complete HSDPA transceiver
system following the same methodology as in [13] is also be-
ing considered.