An Efficient Circulant MIMO Equalizer for CDMA Downlink: Algorithm and VLSI Architecture



Yuanbin Guo et al.

saves roughly 50% of the real multiplications because the FFT
length within 64 points will su
ffice for most realistic equalizer
applications.

4.3. Hermitian matrix inverse architectures

In this section, we utilize the Hermitian feature and focus on
the optimization of the submatrix inverse and multiplication
module following the element-wise FFT modules in the tap
solver. Although the FFT-based tap solver avoids the direct
matrix inverse of the original covariance matrix with the di-
mension of (
NF × NF), the inverse of the diagonal matrix
F is inevitable. For a MIMO receiver with high receive di-
mension, the matrix inverse and multiplication in
F- 1hm is
not trivial. Because
F is a block-diagonal matrix, its inverse
can be decomposed to the inverse of
LF submatrices of size
(
N × N)asin

F -1 = diag (f[0] -1, F[1] -1,..., F[ Lf - 1]- 1).      (15)

A traditional (N×N) matrix inverse using Gaussian elim-
ination has the complexity at
O(N3) complex operations.
Cholesky decomposition can be applied to facilitate the in-
verse of these matrices. However, this method requires arith-
metic square root operation, which is expensive for hardware
implementation. Considering the fact that it is unlikely to
have more than four Rx antennas in a mobile terminal, we
consider the two special cases individually, that is, 2 and 4 Rx
antennas. We propose complexity-reduction schemes and ef-
ficient architectures suitable for VLSI implementation based
on the exploration of block partitioning. The commonality
of the partitioned block matrix inverse is extracted to design
generic RTL modules for reusable modularity. We then build
the (4
× 4) receiver by reusing the (2 × 2) block partitioning.

4.3.1. Dual-antenna MIMO receiver

From (11), a straightforward partitioning is at the matrix
inversion of
F and then the matrix multiplication of the
dimension-wise FFT of the channel coe
fficients as F- 1(D ®
I)hm. In this partitioning, we would first compute the inverse
of the entire subblock matrix in
F and then carry out a ma-
trix multiplication. However, this partitioning involves two
separate loop structures. In the VLSI circuit design, this will
introduce some overhead for memory access and finite-state
machine logic. Since the two steps have the same loop struc-
ture, it is more desirable to merge the two steps and reduce
the overhead shown as follows. The inverse of a (2
× 2) sub-
matrix is given by

F[k]-1 =   fOO(k) fO1(k)    1

f1O(k) f11(k)

=_______________1_______________( fii( k )   - f)1( k ))

foo(k) fn(k) - f)1(k) flo(k) ∖-flo(k) foo(k) ) .

(16)

Let Γ = (D ® I)h m = [Γ[0], Γ[1],..., Γ[ Lf - 1]], where
Γ[k] = [e1 (k) e2(k)] is the combination of the kth elements
of the dimension-wise FFT coe
fficients, then a merged com-
putation of the matrix inverse and multiplication is given by

W = F-1 (D ® I)hm

= diag (F[0]-1, F[1]-1,..., F[ Lf - 1]- 1)Γ

= [f[0] - 1Γ[0] T, F[1]- 1Γ[1] T,..., F[ Lf -1]- 1Γ[ Lf -1] T ].

(17)

W(k) =


Thus, we can use a single merged loop to compute the
final result of
W instead of using separate loops. Moreover,
with the
Hermitian features of F00 and F11, we can reduce the
number of real operations in the matrix inverse and multi-
plication module. This leads to a simplified equation for the
kth element of the matrix W as
f)0(k) fL1(k) - f)1(k) I

( fL1(k) O e 1(k) - f)1(k) * e2(k) ∖
■ ^-flo(k) о e2(k) - f)i(k) * eι(k)J,

(18)


where “a b,” “a O b,” and “a * b” indicate a “real × real,”
“real
× complex” and “complex × complex” multiplication,
respectively. The complex division is replaced by a real divi-
sion. From this, we derived the simplified data path with the
Hermitian optimization as in
Figure 4. In this figure, f00(k)
and
f11(k) are real numbers. The single multiplier means a
real multiplication. The multiplier with a circle means the
“real
× complex” multiplication and the multiplier with a
rectangle is a “complex
× complex” multiplication. The sim-
plified data path facilitates the scaling, and thus increases the
stability in the fixed-point implementation.

4.3.2. Receiver with 4 Rx antennas

The principle operation of interest is the inverse of the (4 × 4)
submatrices. To achieve a scalable design, we first partition
the (4
× 4) submatrices in F[i] into four (2 × 2) block sub
matrices as

f11(i)

f12(i)

f13(i)

f14(i)

F( i )4 × 4 =

f21(i)

f22 (i)

f23 (i)

f24(i)

f31(i)

f32 (i)

f33(i)

f34(i)

(19)

f41 (i)

f42(i)

f43(i)

f44(i)

=

B11(i)

B21(i)

B12(i)

B22(i)

).

The inverse of the (4 × 4) matrix can be carried out by a se-
quential inverse of four (2
× 2) submatrices. We also partition
the inverse of the (4
× 4) element matrix as

F(i)-1 = C11(i) C12(i)

F(i)    =   C21(i) C22(i) .

(20)




More intriguing information

1. The name is absent
2. Equity Markets and Economic Development: What Do We Know
3. The name is absent
4. Auctions in an outcome-based payment scheme to reward ecological services in agriculture – Conception, implementation and results
5. The name is absent
6. The name is absent
7. The name is absent
8. Heterogeneity of Investors and Asset Pricing in a Risk-Value World
9. The name is absent
10. Towards Learning Affective Body Gesture