Yuanbin Guo et al.
saves roughly 50% of the real multiplications because the FFT
length within 64 points will suffice for most realistic equalizer
applications.
4.3. Hermitian matrix inverse architectures
In this section, we utilize the Hermitian feature and focus on
the optimization of the submatrix inverse and multiplication
module following the element-wise FFT modules in the tap
solver. Although the FFT-based tap solver avoids the direct
matrix inverse of the original covariance matrix with the di-
mension of (NF × NF), the inverse of the diagonal matrix
F is inevitable. For a MIMO receiver with high receive di-
mension, the matrix inverse and multiplication in F- 1hm is
not trivial. Because F is a block-diagonal matrix, its inverse
can be decomposed to the inverse of LF submatrices of size
(N × N)asin
F -1 = diag (f[0] -1, F[1] -1,..., F[ Lf - 1]- 1). (15)
A traditional (N×N) matrix inverse using Gaussian elim-
ination has the complexity at O(N3) complex operations.
Cholesky decomposition can be applied to facilitate the in-
verse of these matrices. However, this method requires arith-
metic square root operation, which is expensive for hardware
implementation. Considering the fact that it is unlikely to
have more than four Rx antennas in a mobile terminal, we
consider the two special cases individually, that is, 2 and 4 Rx
antennas. We propose complexity-reduction schemes and ef-
ficient architectures suitable for VLSI implementation based
on the exploration of block partitioning. The commonality
of the partitioned block matrix inverse is extracted to design
generic RTL modules for reusable modularity. We then build
the (4 × 4) receiver by reusing the (2 × 2) block partitioning.
4.3.1. Dual-antenna MIMO receiver
From (11), a straightforward partitioning is at the matrix
inversion of F and then the matrix multiplication of the
dimension-wise FFT of the channel coefficients as F- 1(D ®
I)hm. In this partitioning, we would first compute the inverse
of the entire subblock matrix in F and then carry out a ma-
trix multiplication. However, this partitioning involves two
separate loop structures. In the VLSI circuit design, this will
introduce some overhead for memory access and finite-state
machine logic. Since the two steps have the same loop struc-
ture, it is more desirable to merge the two steps and reduce
the overhead shown as follows. The inverse of a (2 × 2) sub-
matrix is given by
F[k]-1 = fOO(k) fO1(k) 1
f1O(k) f11(k)
=_______________1_______________( fii( k ) - f)1( k ))
foo(k) fn(k) - f)1(k) flo(k) ∖-flo(k) foo(k) ) .
(16)
Let Γ = (D ® I)h m = [Γ[0], Γ[1],..., Γ[ Lf - 1]], where
Γ[k] = [e1 (k) e2(k)] is the combination of the kth elements
of the dimension-wise FFT coefficients, then a merged com-
putation of the matrix inverse and multiplication is given by
W = F-1 ■ (D ® I)hm
= diag (F[0]-1, F[1]-1,..., F[ Lf - 1]- 1)Γ
= [f[0] - 1Γ[0] T, F[1]- 1Γ[1] T,..., F[ Lf -1]- 1Γ[ Lf -1] T ].
(17)
W(k) =
Thus, we can use a single merged loop to compute the
final result of W instead of using separate loops. Moreover,
with the Hermitian features of F00 and F11, we can reduce the
number of real operations in the matrix inverse and multi-
plication module. This leads to a simplified equation for the
kth element of the matrix W as
f)0(k) ■ fL1(k) - ∖f)1(k) I
( fL1(k) O e 1(k) - f)1(k) * e2(k) ∖
■ ^-flo(k) о e2(k) - f)i(k) * eι(k)J,
(18)
where “a ■ b,” “a O b,” and “a * b” indicate a “real × real,”
“real × complex” and “complex × complex” multiplication,
respectively. The complex division is replaced by a real divi-
sion. From this, we derived the simplified data path with the
Hermitian optimization as in Figure 4. In this figure, f00(k)
and f11(k) are real numbers. The single multiplier means a
real multiplication. The multiplier with a circle means the
“real × complex” multiplication and the multiplier with a
rectangle is a “complex × complex” multiplication. The sim-
plified data path facilitates the scaling, and thus increases the
stability in the fixed-point implementation.
4.3.2. Receiver with 4 Rx antennas
The principle operation of interest is the inverse of the (4 × 4)
submatrices. To achieve a scalable design, we first partition
the (4 × 4) submatrices in F[i] into four (2 × 2) block sub
matrices as
f11(i) |
f12(i) |
f13(i) |
f14(i) | ||
F( i )4 × 4 = |
f21(i) |
f22 (i) |
f23 (i) |
f24(i) | |
f31(i) |
f32 (i) |
f33(i) |
f34(i) |
(19) | |
f41 (i) |
f42(i) |
f43(i) |
f44(i) | ||
= |
B11(i) B21(i) |
B12(i) B22(i) |
). |
The inverse of the (4 × 4) matrix can be carried out by a se-
quential inverse of four (2 × 2) submatrices. We also partition
the inverse of the (4 × 4) element matrix as
F(i)-1 = C11(i) C12(i)
F(i) = C21(i) C22(i) .
(20)