An Efficient Circulant MIMO Equalizer for CDMA Downlink: Algorithm and VLSI Architecture



16


EURASIP Journal on Applied Signal Processing

Table 4: Architecture tradeoff exploration for covariance estima-
tion module.

Cycles

1

2

3

4-8

9

__10

MU( a )

0

176

0

0

0

0

AD(a)

0

0

136

0

0

0

MU(b)

0

22

22

22

22

0

AD(b)

0

0

17

17

17

17

Table 5: Architecture efficiency comparison for Catapult C versus
Xilinx IP core.

Architecture

mult

Cycles

Slices

Xilinx core

12

128

2066

Catapult C Sol1

8

570

535

Catapult C Sol2

2

625

543

Catapult C Sol3

___________________1

810

551

(ACR). After the word length is adjusted by shifting, a
separate parallel-read-shu
ffle (PRS) module designed by
Catapult C reads the registers in parallel for [
E0, ... , EL ]and
writes the memory and shu
ffles the Hermitian part [ELH ,
..., E1H]. Memory stalls are avoided and scalability is achieved
because it can stop at any chip to adjust to di
fferent update
rates.

In the PMS, the number of FUs is assigned according
to the time/area constraints. As an example for a (2
× 2)
case with
L = 10, the VLSI area/time tradeoff is shown in
Table 4. The complexity is 176 multiplications and 136 ad-
ditions in each computation period. A typical manual de-
sign will layout 176 multipliers and 136 adders all in parallel.
This will take 4 cycles to complete the computation. How-
ever, the multipliers are in IDLE state for 9 cycles and wasted.
On the other extreme, an area-constraint solution will reuse
one multiplier and one adder, but has to take more than 176
cycles. The most area/time e
fficient architecture in 10 cy-
cles is to reuse 22 multipliers and 15 adders as the pipelined
operations. The multiplexing of so many multipliers in man-
ual RTL layout could be very di
fficult and time consuming.
Moreover, for a changed specification such as the chip rate or
clock rate, we can rapidly reschedule the design to meet the
real-time requirement by using the minimum hardware re-
source. The similar design method is applied for the FIR and
channel estimation.

6.2.2. Block-based MIMO-FFT IP cores

For the multiple FFTs in the tap solver, the keys for optimiza-
tion of the area/speed are loop unrolling, pipelining, and re-
source multiplexing. Although Xilinx provides FFT IP cores,
they are considerably large and much faster than required.
For example, a single v32FFT core in Xilinx CoreGen library
utilizes 12 multipliers and 2066 slices. Moreover, it is not
easy to apply the commonality by using the IP core for the
MIMO-FFTs. To achieve the best area/time tradeo
ff in differ-
ent situations, we design the customized MIMO-FFT mod-
ules to utilize the commonality in control logic and phase co-
e
fficient loading. Parallelism/pipelining in the parallel FFTs
are studied extensively in multilevels, for example, the BFU
level, the stage level, and the FFT-processor level. Catapult
C scheduled RTLs for 32-point FFTs with 16 bits are com-
pared with Xilinx v32FFT Core in
Table 5 for a single FFT.
Catapult C design demonstrates much smaller size for di
ffer-
ent solutions, for example, from solution 1 with 8 multipli-
ers and 535 slices to solution 3 with only one multiplier and
551 slices. Overall, solution 3 represents the smallest design

Table 6: The area/time specification of the major FPGA design
cores.

Architecture

Latency

CLB

ASICMult

Correlator

1 chip

22399

80

16-FFT32

43.1 μ s

2530

4

32 MatInvMult(4 × 4)

37.6 μ s

4526

6

16-IFFT32

43.1 μ s

2530

4

Overall tap solver

123.8 μ s

7109. 3

14

with slower but acceptable speed for a single FFT. For the
MIMO-FFT/IFFT modules, we can reuse the control logic in-
side the FFT module and schedule the number of FUs more
e
fficiently in the merged mode.

6.3. Prototyping implementation

Based on the above algorithmic and architectural optimiza-
tions, we have prototyped the VLSI architecture of a (4
× 4)
MIMO equalizer on the
Aptix FPGA platform [27]. The cor-
relation window is set to 10 chips for all 4 receive anten-
nas. Fixed-point simulation shows that 8-bit input chip could
provide negligible performance loss. To give a safe range, the
input chip samples to both the corelator and the channel esti-
mator have 10-bit precision. The 32-point MIMO-FFT mod-
ule has 16-bit input word length for both the covariance and
channel coe
fficients. To support even faster fading speed, we
design the prototyping system for up to 4 updates per slot
with an overall tap-solving latency requirement of 125 mi-
croseconds. In
Table 6, we give the specification of the ma-
jor design blocks. Overall, we utilize only 4 multipliers to
achieve area/time e
fficient design for 16 merged FFT/IFFT
modules. For the
LF inverse of the (4 × 4) Hermitian sub-
matrices, the latency is 38 microseconds with 6 multipliers.
It is also noticed that the di
fferent modules have very similar
latency, which provides a very balanced pipelining in multi-
ple stages. The overall 124 microseconds meet the real-time
requirement very closely and give area e
fficiency. This effi-
ciency not only benefits from the afore-mentioned algorith-
mic and architectural optimization, but also from the exten-
sive design space exploration to find the most compact de-
sign by meeting the real-time requirement. The integration
of the MIMO equalizer into the complete HSDPA transceiver
system following the same methodology as in [
13] is also be-
ing considered.



More intriguing information

1. Ventas callejeras y espacio público: efectos sobre el comercio de Bogotá
2. Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU
3. Testing for One-Factor Models versus Stochastic Volatility Models
4. The Structure Performance Hypothesis and The Efficient Structure Performance Hypothesis-Revisited: The Case of Agribusiness Commodity and Food Products Truck Carriers in the South
5. A Hybrid Neural Network and Virtual Reality System for Spatial Language Processing
6. Rural-Urban Economic Disparities among China’s Elderly
7. Notes on an Endogenous Growth Model with two Capital Stocks II: The Stochastic Case
8. Subduing High Inflation in Romania. How to Better Monetary and Exchange Rate Mechanisms?
9. Tourism in Rural Areas and Regional Development Planning
10. An alternative way to model merit good arguments
11. DETERMINANTS OF FOOD AWAY FROM HOME AMONG AFRICAN-AMERICANS
12. Making International Human Rights Protection More Effective: A Rational-Choice Approach to the Effectiveness of Ius Standi Provisions
13. Sector Switching: An Unexplored Dimension of Firm Dynamics in Developing Countries
14. AJAE Appendix: Willingness to Pay Versus Expected Consumption Value in Vickrey Auctions for New Experience Goods
15. Foreword: Special Issue on Invasive Species
16. The duration of fixed exchange rate regimes
17. Multi-Agent System Interaction in Integrated SCM
18. The economic doctrines in the wine trade and wine production sectors: the case of Bastiat and the Port wine sector: 1850-1908
19. THE CO-EVOLUTION OF MATTER AND CONSCIOUSNESS1
20. Automatic Dream Sentiment Analysis