An Efficient Circulant MIMO Equalizer for CDMA Downlink: Algorithm and VLSI Architecture



16


EURASIP Journal on Applied Signal Processing

Table 4: Architecture tradeoff exploration for covariance estima-
tion module.

Cycles

1

2

3

4-8

9

__10

MU( a )

0

176

0

0

0

0

AD(a)

0

0

136

0

0

0

MU(b)

0

22

22

22

22

0

AD(b)

0

0

17

17

17

17

Table 5: Architecture efficiency comparison for Catapult C versus
Xilinx IP core.

Architecture

mult

Cycles

Slices

Xilinx core

12

128

2066

Catapult C Sol1

8

570

535

Catapult C Sol2

2

625

543

Catapult C Sol3

___________________1

810

551

(ACR). After the word length is adjusted by shifting, a
separate parallel-read-shu
ffle (PRS) module designed by
Catapult C reads the registers in parallel for [
E0, ... , EL ]and
writes the memory and shu
ffles the Hermitian part [ELH ,
..., E1H]. Memory stalls are avoided and scalability is achieved
because it can stop at any chip to adjust to di
fferent update
rates.

In the PMS, the number of FUs is assigned according
to the time/area constraints. As an example for a (2
× 2)
case with
L = 10, the VLSI area/time tradeoff is shown in
Table 4. The complexity is 176 multiplications and 136 ad-
ditions in each computation period. A typical manual de-
sign will layout 176 multipliers and 136 adders all in parallel.
This will take 4 cycles to complete the computation. How-
ever, the multipliers are in IDLE state for 9 cycles and wasted.
On the other extreme, an area-constraint solution will reuse
one multiplier and one adder, but has to take more than 176
cycles. The most area/time e
fficient architecture in 10 cy-
cles is to reuse 22 multipliers and 15 adders as the pipelined
operations. The multiplexing of so many multipliers in man-
ual RTL layout could be very di
fficult and time consuming.
Moreover, for a changed specification such as the chip rate or
clock rate, we can rapidly reschedule the design to meet the
real-time requirement by using the minimum hardware re-
source. The similar design method is applied for the FIR and
channel estimation.

6.2.2. Block-based MIMO-FFT IP cores

For the multiple FFTs in the tap solver, the keys for optimiza-
tion of the area/speed are loop unrolling, pipelining, and re-
source multiplexing. Although Xilinx provides FFT IP cores,
they are considerably large and much faster than required.
For example, a single v32FFT core in Xilinx CoreGen library
utilizes 12 multipliers and 2066 slices. Moreover, it is not
easy to apply the commonality by using the IP core for the
MIMO-FFTs. To achieve the best area/time tradeo
ff in differ-
ent situations, we design the customized MIMO-FFT mod-
ules to utilize the commonality in control logic and phase co-
e
fficient loading. Parallelism/pipelining in the parallel FFTs
are studied extensively in multilevels, for example, the BFU
level, the stage level, and the FFT-processor level. Catapult
C scheduled RTLs for 32-point FFTs with 16 bits are com-
pared with Xilinx v32FFT Core in
Table 5 for a single FFT.
Catapult C design demonstrates much smaller size for di
ffer-
ent solutions, for example, from solution 1 with 8 multipli-
ers and 535 slices to solution 3 with only one multiplier and
551 slices. Overall, solution 3 represents the smallest design

Table 6: The area/time specification of the major FPGA design
cores.

Architecture

Latency

CLB

ASICMult

Correlator

1 chip

22399

80

16-FFT32

43.1 μ s

2530

4

32 MatInvMult(4 × 4)

37.6 μ s

4526

6

16-IFFT32

43.1 μ s

2530

4

Overall tap solver

123.8 μ s

7109. 3

14

with slower but acceptable speed for a single FFT. For the
MIMO-FFT/IFFT modules, we can reuse the control logic in-
side the FFT module and schedule the number of FUs more
e
fficiently in the merged mode.

6.3. Prototyping implementation

Based on the above algorithmic and architectural optimiza-
tions, we have prototyped the VLSI architecture of a (4
× 4)
MIMO equalizer on the
Aptix FPGA platform [27]. The cor-
relation window is set to 10 chips for all 4 receive anten-
nas. Fixed-point simulation shows that 8-bit input chip could
provide negligible performance loss. To give a safe range, the
input chip samples to both the corelator and the channel esti-
mator have 10-bit precision. The 32-point MIMO-FFT mod-
ule has 16-bit input word length for both the covariance and
channel coe
fficients. To support even faster fading speed, we
design the prototyping system for up to 4 updates per slot
with an overall tap-solving latency requirement of 125 mi-
croseconds. In
Table 6, we give the specification of the ma-
jor design blocks. Overall, we utilize only 4 multipliers to
achieve area/time e
fficient design for 16 merged FFT/IFFT
modules. For the
LF inverse of the (4 × 4) Hermitian sub-
matrices, the latency is 38 microseconds with 6 multipliers.
It is also noticed that the di
fferent modules have very similar
latency, which provides a very balanced pipelining in multi-
ple stages. The overall 124 microseconds meet the real-time
requirement very closely and give area e
fficiency. This effi-
ciency not only benefits from the afore-mentioned algorith-
mic and architectural optimization, but also from the exten-
sive design space exploration to find the most compact de-
sign by meeting the real-time requirement. The integration
of the MIMO equalizer into the complete HSDPA transceiver
system following the same methodology as in [
13] is also be-
ing considered.



More intriguing information

1. Neural Network Modelling of Constrained Spatial Interaction Flows
2. The name is absent
3. Konjunkturprognostiker unter Panik: Kommentar
4. An Attempt to 2
5. The name is absent
6. Program Semantics and Classical Logic
7. Higher education funding reforms in England: the distributional effects and the shifting balance of costs
8. Staying on the Dole
9. The name is absent
10. Stillbirth in a Tertiary Care Referral Hospital in North Bengal - A Review of Causes, Risk Factors and Prevention Strategies