Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU



TABLE I: Operands for αk computation

Ub =

0___

Ub =

1____

Thread id (i)

sk-1

Pk

sk-1

Pk

0

0

0

1

1

1

3

1

2

0

2

4

_iΓ^'

5

~0~

3

7

~o~

6

~1Γ^'

4

1

~o~

0

^^1Γ^'

5

2

~iΓ^'

3

~o~

6

5

^^Γ-

4

^^0^^

7

6

0

7

1

Algorithm 1 thread i computes αk (i)

a о ^ αk-1( sk-1 = 0) + Lc ( yss. ) * (pk\щ = 0)

a 1 ^ αk-1(sk-1 \ub = 1) + (Lc (yk) + La (k))

+Lc(psk)(pk\ub =1)

αk ( i ) = max * ( a 0 ,a 1)

SYNC

C. Backward Traversal and LLR Computation

After the forward traversal, each thread block traverses through
the trellis backward to compute
β . We assign one thread to each
trellis level to compute
β , followed by computing Λ0 and Λ1
shown in Algorithm 2. The indices of βk+1 and values of pk are
summarized in Table II. Similar to the forward traversal, there
are no shared memory bank conflicts since each thread accesses
an element of
α or β in a different bank.

TABLE II: Operands for βk computation

Ub =

0___

Ub =

1____

Thread id (i)

sk+1

Pk

sk+1

Pk

0

0

0

4

1

1

4

0

0

1

2

5

1

1

0

3

1

1

5

0

4

2

1

6

0

5

6

1

2

0

6

7

0

3

1

7

3

0

7

1

Algorithm 2 thread i computes βk (i) and Λ0(i) and Λ1 (i)

bо ^ αk(Sk\ub = 0) + Lc(yss.) * (Pkь = 0)

b 1 ^ αk +1(sk+1 ub = 1) + (Lc(ys) + La (k))

+Lc(psk)(pk\ub =1)

βk ( i ) = max * ( b о ,b 1)

SYNC

Λo( i ) = αk ( i ) + Lp ( i ) pk + βk+1( i )

Λ1 (i)=αk (i) + (Lc(k) + La (k)) + Lp (sk)pk + βk (i)

After computing Λo and Λ1 for stage k, we can compute the
extrinsic LLR for stage
k. However, there are 8 threads avail-
able to compute the single LLR, which introduces parallelism
overhead. Instead of computing one extrinsic LLR for stage
k
as soon as the decoder computes βk, we allow the threads to
traverse through the trellis and save 8 stages of Λ
o and Λ1
before performing extrinsic LLR computations. By saving eight
stages of Λ
o and Λ1, we allow all 8 threads to compute LLRs in
parallel efficiently. Each thread handles one stage of Λ
o and Λ1
to compute an LLR. Although this increases thread utilization,
threads need to avoid accessing the same bank when computing
extrinsic LLR. For example, 8 elements of Λ
o for each stage are
stored in 8 consecutive addresses. Since there are 16 memory
banks, elements of even stages Λ
o or Λ1 with the same index
would share the same memory bank. Likewise, this is true for
even stages of Λ
o . Hence, sequential accesses to Λo or Λ1 to
compute extrinsic LLR will result in four-way memory bank
conflicts. To alleviate this problem, we permute the access pattern
based on thread ID as shown in Algorithm 3.

Algorithm 3 thread i computes Le (i)

λ 0 = Λo( i )

λ1 1(i)

for j =1 to 7 do

index =(i + j)&7

λo = max* (λo, Λo(index))

λ1 = max* (λ1, Λ1(index))

Le = λ1 - λo

Compute write address

Write Le to device memory

end for

D. Interleaver

The interleaver is used in the second half iteration of the MAP
decoding algorithm. In our implementation, a quadratic permu-
tation polynomial (QPP) interleaver [17], which is proposed in
the 3GPP LTE standard was used. Although the QPP interleaver
is contention free since it can guarantee bank free memory
access, where each sub-block accesses a different memory bank.
However, the memory access pattern is still random. Since the
inputs are shared in device memory, memory accesses are not
necessarily coalesced. We reduce latency by pre-fetching data
into the shared memory. The QPP interleaver is defined as:

Π(x) = f1x + f2x2 (mod N).              (7)

Direct computation of Π(x) using Equation (7) can cause over-
flow. For example, 6143
2 can not be represented as a 32-bit
integer. The following equation is used to compute Π(
x) instead:

Π(x) = (f 1 + f2x (mod N)) ∙ x (mod N)       (8)

Another alternative is to compute Π(x) recursively [6], which
requires Π(
x) to be computed before we can compute Π(x +1).
This is not efficient for our design as we need to compute several
interleaved addresses in parallel. For example, during the second
half of the iteration to store extrinsic LLR values, 8 threads need
to compute 8 interleaved address in parallel. Equation (8) allows
efficient address computation in parallel.

Although our decoder is configured for the 3GPP LTE stan-
dard, one can replace the current interleaver function with

195




More intriguing information

1. Protocol for Past BP: a randomised controlled trial of different blood pressure targets for people with a history of stroke of transient ischaemic attack (TIA) in primary care
2. Assessing Economic Complexity with Input-Output Based Measures
3. Before and After the Hartz Reforms: The Performance of Active Labour Market Policy in Germany
4. Mergers under endogenous minimum quality standard: a note
5. 5th and 8th grade pupils’ and teachers’ perceptions of the relationships between teaching methods, classroom ethos, and positive affective attitudes towards learning mathematics in Japan
6. The Provisions on Geographical Indications in the TRIPS Agreement
7. Spousal Labor Market Effects from Government Health Insurance: Evidence from a Veterans Affairs Expansion
8. The name is absent
9. The name is absent
10. El impacto espacial de las economías de aglomeración y su efecto sobre la estructura urbana.El caso de la industria en Barcelona, 1986-1996
11. The name is absent
12. IMMIGRATION POLICY AND THE AGRICULTURAL LABOR MARKET: THE EFFECT ON JOB DURATION
13. Examining Variations of Prominent Features in Genre Classification
14. The name is absent
15. Group cooperation, inclusion and disaffected pupils: some responses to informal learning in the music classroom
16. The name is absent
17. The name is absent
18. The name is absent
19. Experience, Innovation and Productivity - Empirical Evidence from Italy's Slowdown
20. The name is absent