EURASIP Journal on Applied Signal Processing
In this paper, we first present an FFT-based fast algorithm
for the tap solving by approximating the block Toeplitz struc-
ture of the covariance matrix with a block-circulant matrix to
avoid the direct matrix inverse. The inverse of the large co-
variance matrix is reduced to some parallel FFT/IFFT opera-
tions and the inverse of some much smaller submatrices. This
algorithm reduces the complexity order to O(NFlog2(F)),
which makes the real-time implementation much easier. An
algorithmic-level comparative study for different equaliz-
ers demonstrates their promising performance/complexity
tradeoff.
As real-time implementation is concerned, system-on-
chip (SoC) architecture offers more parallelism, more com-
pact size, and lower power consumption than general pur-
pose DSP processors. However, the research for the SoC
architectures of MIMO-HSDPA mobile receiver remains
a relatively new and hot topic. Recently, Nokia success-
fully demonstrated a single-antenna HSDPA real-time sys-
tem in the CTIA’03 wireless trade show [13, 14]. Although
MIMO-VLSI implementations have been reported for Lu-
cent’s BLAST ASIC chip [15] and some MIMO detection
algorithms [16], the VLSI architecture design of MIMO-
CDMA equalizers remains a new research topic. To support
the MIMO-CDMA downlink in a multipath fading channel,
it is necessary to explore the efficient VLSI design architec-
ture [17] for the complex equalizer.
In the second part, we focus on the VLSI-oriented op-
timizations of the architecture complexity. Hermitian opti-
mization is proposed by utilizing the structures of the cor-
relation coefficients and the FFT algorithm. A reduced-state
FFT module is proposed to avoid redundant computation
of the symmetric coefficients and the zero coefficients. This
reduces both the number and complexity of the conven-
tional FFT module. On the other hand, the matrix inverse
of some smaller submatrices of size (N × N) is inevitable
for the MIMO receiver although the (NF × NF) inverse
is avoided. For a high-order MIMO receiver, the complex-
ity still increases dramatically with the number of antennas.
Therefore, the Hermitian feature is applied to reduce the sub-
matrix inverse complexity. Of particular interest is the non-
trivial (4 × 4) MIMO configuration. We apply a divide-and-
conquer method to partition the (4×4) submatrices into four
(2 × 2) submatrices. The (4 × 4) matrix inverse is then dra-
matically simplified by exploring the commonality in a parti-
tioned matrix inverse lemma. Generic VLSI architectures are
derived from the special design blocks to eliminate the re-
dundancies in the complex operations. The regulated model
facilitates the design of efficient parallel VLSI modules such
as “complex-Hermitian-multiplication,” “Hermitian inverse”
and “diagonal transform.” This leads to efficient architectures
with 3× further complexity reduction and more parallel and
pipelined schematic.
In addition to minimizing the circuit area used, the de-
sign needs to work within a time budget. There are many
area/time tradeoffs in the VLSI architectures. Extensive ar-
chitecture tradeoff study provides critical insights into im-
plementation issues that may arise during the product de-
velopment process. However, this type of SoC design space
exploration is extremely time consuming because the stan-
dard trial-and-optimize approaches today are usually tied to
hand-coded VHDL/Verilog-based methodology [18, 19]. In
this paper, we present a Catapult C-based [13] high-level-
synthesis (HLS) methodology which integrates several key
technologies to explore the VLSI architecture tradeoffs ex-
tensively. Extensive design space exploration is enabled by al-
locating different architecture/resource constraints in a Cat-
apult C architecture scheduler [13]. Synthesizable register-
transfer-level (RTL) design is generated from an algorithmic
C/C++ fixed-point design, integrated in other downstream
flows and validated in a Xilinx FPGA prototyping platform.
The rest of the paper is organized as follows. Section 2
gives the MIMO-CDMA downlink system model. The FFT-
based circulant chip equalizer is presented in Section 3.
Section 4 presents the system-level partitioning and the
VLSI-level complexity optimization. The comparative per-
formance and complexity analysis are presented in Section 5.
Finally, Section 6 presents the HLS-based design space explo-
ration and an experimental implementation on FPGA.
2. SYSTEM MODEL FOR MIMO-CDMA DOWNLINK
The system model of the MIMO multicode CDMA down-
link with M Tx antennas and N Rx antennas is described in
Figure 1. In a multicode CDMA downlink, multiple spread-
ing codes are assigned to a single user to achieve high data
rate. By using spatial multiplexing, the high data rate symbols
are demultiplexed into KM lower-rate substreams, where K
is the number of spreading codes for data transmission. The
substreams are divided into M groups, where each substream
in the group is spreaded with a spreading code of spreading
gain G. Each group is then combined and scrambled with
long scrambling codes and transmitted through the mth Tx
antenna. The chip-level signal at the mth transmit antenna is
given by dm ( i + j * G ) = ∑ K= ɪ sm ( j ) ckm ( i )+ sm ( j ) cm ( i ),where
j is the symbol index, i is the chip index, and k is the index of
the composite spreading code. skm (j) is the jth symbol of the
kth code at the mth substream. In the following, we focus on
the jth symbol and omit the symbol index for notation sim-
plicity. cmk (i) = ck (i)cm(s)(i) is the composite spreading code
sequence for the kth code at the mth substream, where ck (i)
is the user-specific Hadamard code and cm(s)(i) is the antenna-
specific scrambling long code. sPm (j) denotes the pilot sym-
bols at the mth antenna. cmP (i) = cP(i)cm(s)(i) is the composite
spreading code for pilot symbols at the mth antenna. The re-
ceived chip-level signal at the nth Rx antenna is given by
M Lm,n
rn ( i ) = ∑ hmh>ι, n ( l ) dm ( i — τι ) + Zn ( i ), (1)
m=1 l=0
where hm,n(l) and Lm,n are the lth path channel coefficient
and the delay spread between the mth Tx antenna and the
nth Rx antenna, respectively. zn(i) is the additive Gaussian
noise at the nth receive antenna.
By packing the received chips from all the receive anten-
nas in a vector r(i) = [r1(i), ..., rn(i), ..., rN (i)]T and collect-
ing the LF = 2F + 1 consecutive chips with center at the ith