Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU

TABLE I: Operands for α_k computation

	Ub =	0___	Ub =	1____
Thread id (i)	^sk-1	Pk	^sk-1	Pk
0	0	0	1	1
1	3	1	2	0
2	4	^_iΓ^'	5	~0~
3	7	~o~	6	~1Γ^'
4	1	~o~	0	^^1Γ^'
5	2	~iΓ^'	3	~o~
6	5	^^Γ^-	4	^^0^^
7	6	0	7	1

Algorithm 1 thread i computes α_k (i)

a о ^ αk-1( sk-1 \щ = 0) + Lc ( y^ss. ) * (pk\щ = 0)

^a 1 ^ ^αk-1⁽^sk-1 \ub = ¹) ^{+ (}Lc ⁽^yk) ⁺ La ⁽^k))

+Lc(p^s_k)(pk\ub =1)

αk ( i ) = max * ( a ₀ ,a ₁)

SYNC

C. Backward Traversal and LLR Computation

After the forward traversal, each thread block traverses through
the trellis backward to compute β . We assign one thread to each
trellis level to compute β , followed by computing Λ₀ and Λ₁shown in Algorithm 2. The indices of βk+₁ and values of pk are
summarized in Table II. Similar to the forward traversal, there
are no shared memory bank conflicts since each thread accesses
an element of α or β in a different bank.

TABLE II: Operands for βk computation

	Ub =	0___	Ub =	1____
Thread id (i)	^sk+1	Pk	^sk+1	Pk
0	0	0	4	1
1	4	0	0	1
2	5	1	1	0
3	1	1	5	0
4	2	1	6	0
5	6	1	2	0
6	7	0	3	1
7	3	0	7	1

Algorithm 2 thread i computes βk (i) and Λ0(i) and Λ1 (i)

bо ^ αk+ι(Sk+ι\ub = 0) + Lc(y^ss.) * (Pk\иь = 0)

^b 1 ^ ^αk +1⁽^sk+1 ∖^ub = ¹) ^{+ (}Lc⁽^ys) ⁺ La ⁽^k))

+Lc(p^s_k)(pk\ub =1)

βk ( i ) = max * ( b о ,b 1)

SYNC

Λo( i ) = α_k ( i ) + Lp ( i ) pk + β_k+1( i )

Λ₁ (i)=α_k (i) + (L_c(k) + L_a (k)) + L_p (s_k)p_k + β_k (i)

After computing Λ_o and Λ₁ for stage k, we can compute the
extrinsic LLR for stage k. However, there are 8 threads avail-
able to compute the single LLR, which introduces parallelism
overhead. Instead of computing one extrinsic LLR for stage k
as soon as the decoder computes βk, we allow the threads to
traverse through the trellis and save 8 stages of Λ_o and Λ₁before performing extrinsic LLR computations. By saving eight
stages of Λ_o and Λ₁, we allow all 8 threads to compute LLRs in
parallel efficiently. Each thread handles one stage of Λ_o and Λ₁to compute an LLR. Although this increases thread utilization,
threads need to avoid accessing the same bank when computing
extrinsic LLR. For example, 8 elements of Λ_o for each stage are
stored in 8 consecutive addresses. Since there are 16 memory
banks, elements of even stages Λ_o or Λ₁ with the same index
would share the same memory bank. Likewise, this is true for
even stages of Λ_o . Hence, sequential accesses to Λ_o or Λ₁ to
compute extrinsic LLR will result in four-way memory bank
conflicts. To alleviate this problem, we permute the access pattern
based on thread ID as shown in Algorithm 3.

Algorithm 3 thread i computes L_e (i)

λ 0 = Λo( i )

λ1 =Λ1(i)

for j =1 to 7 do

index =(i + j)&7

λo = max^* (λo, Λo(index))

λ1 = max^* (λ1, Λ1(index))

Le = λ1 - λo

Compute write address

Write Le to device memory

end for

D. Interleaver

The interleaver is used in the second half iteration of the MAP
decoding algorithm. In our implementation, a quadratic permu-
tation polynomial (QPP) interleaver [17], which is proposed in
the 3GPP LTE standard was used. Although the QPP interleaver
is contention free since it can guarantee bank free memory
access, where each sub-block accesses a different memory bank.
However, the memory access pattern is still random. Since the
inputs are shared in device memory, memory accesses are not
necessarily coalesced. We reduce latency by pre-fetching data
into the shared memory. The QPP interleaver is defined as:

Π(x) = f₁x + f₂x² (mod N). (7)

Direct computation of Π(x) using Equation (7) can cause over-
flow. For example, 6143² can not be represented as a 32-bit
integer. The following equation is used to compute Π(x) instead:

Π(x) = (f ₁ + f2x (mod N)) ∙ x (mod N) (8)

Another alternative is to compute Π(x) recursively [6], which
requires Π(x) to be computed before we can compute Π(x +1).
This is not efficient for our design as we need to compute several
interleaved addresses in parallel. For example, during the second
half of the iteration to store extrinsic LLR values, 8 threads need
to compute 8 interleaved address in parallel. Equation (8) allows
efficient address computation in parallel.

Although our decoder is configured for the 3GPP LTE stan-
dard, one can replace the current interleaver function with

195

More intriguing information

1. The name is absent
2. Impact of Ethanol Production on U.S. and Regional Gasoline Prices and On the Profitability of U.S. Oil Refinery Industry
3. The name is absent
4. Plasmid-Encoded Multidrug Resistance of Salmonella typhi and some Enteric Bacteria in and around Kolkata, India: A Preliminary Study
5. The name is absent
6. Contribution of Economics to Design of Sustainable Cattle Breeding Programs in Eastern Africa: A Choice Experiment Approach
7. The Making of Cultural Policy: A European Perspective
8. The name is absent
9. The name is absent
10. Imperfect competition and congestion in the City