EURASIP Journal on Applied Signal Processing
The direct inverse of the matrix is very expensive for hard-
ware implementation.
To reduce the computation complexity, an FFT-based fast
algorithm is presented in this section. It is known that a cir-
culant matrix S can be diagonalized by the FFT operation
as S = DHΛD, where D is the FFT phase coefficient ma-
trix and Λ is a diagonal matrix whose diagonal elements are
the FFT result of the first column of the circulant matrix S.
This known lemma is applied to simplify the MIMO equal-
izer computation dramatically. It is shown that the covari-
ance matrix Rrr can be approximated by a block-circulant
matrix after we add two corner matrices as
0
'rr R rr + .
CL
0CLH
.
.. 0
00
(7)
where
/Eh [ L ] 0 0 ∖
CL = . ... 0
∖Eh [1] ■■■ Eh [ L ])
(8)
Using the extension of the diagonalization lemma and the
features of Kronecker product, the block-circulant matrix can
be decomposed as
Crr = (Dh 0 I)( ɪ Wi 0 E[iɔ (D ® I),
(9)
where W = diag(1, W-1 ■ ■ ■ W- LF 1)) and WLF = e(2n/LF)
is the phase factor coefficient for the DFT computation.
denotes the Kronecker product. By denoting
F=
( ɪ Wi 0 E[if),
(10)
it can be shown that the final MIMO equalizer taps are com-
puted as the following equation:
wopt ≈ (Dh 0 I) ■ F-1 ■ (D 0 I)hm. (11)
F = diag(F[0], F[1], ..., F[LF]) is a block-diagonal matrix
with elements taken from the element-wise FFT of the first
column of the block-circular matrix Crr. For an (M × N)
MIMO system, this reduces the inverse of an (NLF × NLF)
matrix to the inverse of subblock matrices with size (N × N).
3.3. System-on-chip (SoC) architecture partitioning
To achieve the real-time implementation, either DSP pro-
cessors or VLSI architectures could be applied. For exam-
ple, a multiple-processor architecture using TI’s DSP proces-
sors has been reported in [21] for the 3G base station im-
plementation. However, the requirement for low power con-
sumption and compact size makes it difficult to use multiple
DSPs in a mobile handset to achieve the real-time processing
power for the chip-level physical layer design, especially for
the MIMO systems. SoC architecture is a major revolution
for integrated circuits due to the unprecedented levels of
integration and many advantages on the power consump-
tion and compact size. However, the straightforward imple-
mentation of the proposed equalizer has many redundancies
in computation. Many optimizations are needed to make it
more suitable for real-time implementation. We emphasize
the interaction between architecture, system partitioning,
and pipelining in this section with these objectives: (1) pro-
pose VLSI-oriented optimizations to further reduce the com-
putation complexity; (2) implement the equalizer with the
minimum hardware resource to meet the real-time require-
ment; (3) obtain an efficient architecture with optimal par-
allelism and pipelining for the critical computation parts. To
explore the efficient architectures, we elaborate the tasks as
the following procedure.
(1) Compute the independent correlation elements [E[0],
... , E[L]], and form the first block column of circu-
lant Cr(1r) by adding the corner elements as Cr(1r) =
[E[0],...,E[L],0,...,0,EH [L],...,EH [1]]T. Each ele-
ment is an (N × N)subblockmatrix.
(2) Take the element-wise FFT of C(r1r), where the element
vectors Fn1,n2 = FFT{E(nc1),n2} and E(nc1),n2 (i) = Cr(1r)[(n1 -
i - 1) * N + n2 - 1] for i ∈ [0,LF], n 1,n2 ∈ [1,N].
(3) For m ∈ [1, M ]andn ∈ [1, N ], compute the
dimension-wise FFT of the channel estimation as
Φm = (D 0 I)hm = FFT([0, ... ,0, hm,n(L),... , hm,n(0),
0,...,0]).
(4) Compute the inverse of the block-diagonal matrix F,
where F-1 = diag(F[0]-1, ..., F[LF - 1]-1) and F[i] is
an (N × N) submatrix formed from the ith subcarrier
of Fn1,n2.
(5) Compute the matrix multiplication of the submatrices
inverse with the FFT output of channel estimation co-
efficients Ψm = F-1 Φm .
(6) Compute the dimension-wise IFFT of the multiplica-
tion results wopt ≈ (Dh 0 I)Ψm.
With a timing- and data-dependency analysis, the top-
level block diagram for the MIMO equalizer is shown in
Figure 2. The system-level pipeline is designed for bet-
ter modularity. In the front end, a correlation estimation
block takes the multiple-input samples for each chip to
compute the covariance coefficients of the first column of
Rrr. It is made circulant by adding corner to form the
matrix [E[0], ... , E[L], 0, ... , 0, E[L]H , ... , E[1]H ]. The com-
plete coefficients are written to DPRAMs and the (N × N)
element-wise FFT module computes [F[0], ..., F[LF ]] =
FFT{E[0],...,E[L],0,...,0,E[L]H ,...,E[1]H }.
Another parallel data path is for the channel estimation
and the (M × N) dimension-wise FFTs on the channel coef-
ficient vectors as in (D 0 I)hm. A submatrix inverse and mul-
tiplication block takes the FFT coefficients of both the chan-
nel estimation and correlation estimation coefficients from
DPRAMs and carries out the computation as in F-1Φm. Fi-
nally, an (M × N) dimension-wise IFFT module generates
the results for the equalizer taps wopt and sends them to the