TABLE I: Operands for αk computation
Ub = |
0___ |
Ub = |
1____ | |
Thread id (i) |
sk-1 |
Pk |
sk-1 |
Pk |
0 |
0 |
0 |
1 |
1 |
1 |
3 |
1 |
2 |
0 |
2 |
4 |
_iΓ^' |
5 |
~0~ |
3 |
7 |
~o~ |
6 |
~1Γ^' |
4 |
1 |
~o~ |
0 |
^^1Γ^' |
5 |
2 |
~iΓ^' |
3 |
~o~ |
6 |
5 |
^^Γ- |
4 |
^^0^^ |
7 |
6 |
0 |
7 |
1 |
Algorithm 1 thread i computes αk (i)
a о ^ αk-1( sk-1 \щ = 0) + Lc ( yss. ) * (pk\щ = 0)
a 1 ^ αk-1(sk-1 \ub = 1) + (Lc (yk) + La (k))
+Lc(psk)(pk\ub =1)
αk ( i ) = max * ( a 0 ,a 1)
SYNC
C. Backward Traversal and LLR Computation
After the forward traversal, each thread block traverses through
the trellis backward to compute β . We assign one thread to each
trellis level to compute β , followed by computing Λ0 and Λ1
shown in Algorithm 2. The indices of βk+1 and values of pk are
summarized in Table II. Similar to the forward traversal, there
are no shared memory bank conflicts since each thread accesses
an element of α or β in a different bank.
TABLE II: Operands for βk computation
Ub = |
0___ |
Ub = |
1____ | |
Thread id (i) |
sk+1 |
Pk |
sk+1 |
Pk |
0 |
0 |
0 |
4 |
1 |
1 |
4 |
0 |
0 |
1 |
2 |
5 |
1 |
1 |
0 |
3 |
1 |
1 |
5 |
0 |
4 |
2 |
1 |
6 |
0 |
5 |
6 |
1 |
2 |
0 |
6 |
7 |
0 |
3 |
1 |
7 |
3 |
0 |
7 |
1 |
Algorithm 2 thread i computes βk (i) and Λ0(i) and Λ1 (i)
bо ^ αk+ι(Sk+ι\ub = 0) + Lc(yss.) * (Pk\иь = 0)
b 1 ^ αk +1(sk+1 ∖ub = 1) + (Lc(ys) + La (k))
+Lc(psk)(pk\ub =1)
βk ( i ) = max * ( b о ,b 1)
SYNC
Λo( i ) = αk ( i ) + Lp ( i ) pk + βk+1( i )
Λ1 (i)=αk (i) + (Lc(k) + La (k)) + Lp (sk)pk + βk (i)
After computing Λo and Λ1 for stage k, we can compute the
extrinsic LLR for stage k. However, there are 8 threads avail-
able to compute the single LLR, which introduces parallelism
overhead. Instead of computing one extrinsic LLR for stage k
as soon as the decoder computes βk, we allow the threads to
traverse through the trellis and save 8 stages of Λo and Λ1
before performing extrinsic LLR computations. By saving eight
stages of Λo and Λ1, we allow all 8 threads to compute LLRs in
parallel efficiently. Each thread handles one stage of Λo and Λ1
to compute an LLR. Although this increases thread utilization,
threads need to avoid accessing the same bank when computing
extrinsic LLR. For example, 8 elements of Λo for each stage are
stored in 8 consecutive addresses. Since there are 16 memory
banks, elements of even stages Λo or Λ1 with the same index
would share the same memory bank. Likewise, this is true for
even stages of Λo . Hence, sequential accesses to Λo or Λ1 to
compute extrinsic LLR will result in four-way memory bank
conflicts. To alleviate this problem, we permute the access pattern
based on thread ID as shown in Algorithm 3.
Algorithm 3 thread i computes Le (i)
λ 0 = Λo( i )
λ1 =Λ1(i)
for j =1 to 7 do
index =(i + j)&7
λo = max* (λo, Λo(index))
λ1 = max* (λ1, Λ1(index))
Le = λ1 - λo
Compute write address
Write Le to device memory
end for
D. Interleaver
The interleaver is used in the second half iteration of the MAP
decoding algorithm. In our implementation, a quadratic permu-
tation polynomial (QPP) interleaver [17], which is proposed in
the 3GPP LTE standard was used. Although the QPP interleaver
is contention free since it can guarantee bank free memory
access, where each sub-block accesses a different memory bank.
However, the memory access pattern is still random. Since the
inputs are shared in device memory, memory accesses are not
necessarily coalesced. We reduce latency by pre-fetching data
into the shared memory. The QPP interleaver is defined as:
Π(x) = f1x + f2x2 (mod N). (7)
Direct computation of Π(x) using Equation (7) can cause over-
flow. For example, 61432 can not be represented as a 32-bit
integer. The following equation is used to compute Π(x) instead:
Π(x) = (f 1 + f2x (mod N)) ∙ x (mod N) (8)
Another alternative is to compute Π(x) recursively [6], which
requires Π(x) to be computed before we can compute Π(x +1).
This is not efficient for our design as we need to compute several
interleaved addresses in parallel. For example, during the second
half of the iteration to store extrinsic LLR values, 8 threads need
to compute 8 interleaved address in parallel. Equation (8) allows
efficient address computation in parallel.
Although our decoder is configured for the 3GPP LTE stan-
dard, one can replace the current interleaver function with
195