Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU



switching over to another independent WARP instruction on a
stall.

Computation throughput can still become I/O limited if mem-
ory bandwidth is low. Fortunately, fast on-chip resources, such
as registers, shared memory and constant memory, can be used
in place of off-chip device memory to keep the computation
throughput high. Shared memory is especially useful. It can
reduce memory access time by keeping data on-chip and reduce
redundant calculations by allowing data sharing among indepen-
dent threads. However, shared memory on each SM has 16 access
ports. It takes one cycle if 16 consecutive threads access the same
port (broadcast) or none of the threads access the same port (one
to one). However, a random layout with some broadcast and
some one-to-one accesses will be serialized and cause a stall.
There are several other limitations with shared memory. First,
only threads within a block can share data among themselves
and threads between blocks can not share data through shared
memory. Second, there are only (16KB) of shared memory on
each stream multiprocessor and shared memory is divided among
the concurrent thread blocks on an SM. Using too much shared
memory can reduce the number of concurrent thread blocks
mapped onto an SM.

As a result, it is a challenging task to implement an algorithm
that keeps the GPU cores from idling-we need to partition the
workload across cores, while effectively using shared memory,
and ensuring a sufficient number of concurrently executing thread
blocks.

III. MAP decoding algorithm

The principle of Turbo decoding is based on the BCJR or MAP
(maximum
a posteriori) algorithms [15]. The structure of a MAP
decoder is shown in Figure 1. One iteration of the decoding
process consists of one pass through both decoders. Although
both decoders perform the same set of computations, the two
decoders have different inputs. The inputs of the first decoder
are the deinterleaved extrinsic log-likelihood ratios (LLRs) from
the second decoder and the input LLRs from the channel. The
inputs of the second decoder are the interleaved extrinsic LLRs
from the first decoder and the input LLRs from the channel.

Fig. 1: Overview of turbo decoding

To decode a codeword with N information bits, each decoder
performs a forward traversal followed by a backward traversal
through an
N -stage trellis to compute an extrinsic LLR for each
bit. The trellis structure, or the connections between two stages
of the trellis, is defined by the encoder. Figure 2 shows the trellis
structure for the 3GPP LTE turbo code, where each state has two
incoming paths, one path for
ub =0 and one path for ub =1.
Let
sk be a state at stage k, the branch metric (or transition
probability) is defined as:

γk(sk-1,sk)=(Lc(yks)+La(yks))uk +Lc(ykp)pk,      (1)
where
uk, the information bit, and pk, the parity bit, are de-
pendent on the path taken
(sk+1, sk). Lc(yks) is the systematic
channel LLR,
La(yks ) is the a priori LLR, and Lc(ykp) is the
parity bit channel LLR at stage
k. The decoder first performs

Fig. 2: 3GPP LTE turbo code trellis with 8 states

a forward traversal to compute αk, the forward state metrics for
the trellis state in stage
k. The state metrics αk are computed
recursively as the computation depends on
αk-1. The forward
state metric for a state
sk at stage k, αk (sk), is defined as:

αk(Sk) = max*h- 1 k(ak-ɪ(Sk-ɪ) + γ(Sk-ɪ, Sk)),      (2)

where K is the set of paths that connect a state in stage k - 1
to state Sk in stage k.

After the decoder performs a forward traversal, the decoder
performs a backward traversal to compute
βk, the backward state
metrics for the trellis state in stage
k. The backward state metric
for state
Sk at stage k, βk (Sk ), is defined as:

βk ( Sk ) = max *k+1 k ( βk( Sk) + γ ( Sk,Sk )).      (3)

Although the computation is the same as the computation for
αk , the state transitions are different. In this case, K is the set
of paths that connect a state in stage
k +1 to state Sk in stage
k.

After computing βk , the state metrics for all states in stage k,
we compute two LLRs per trellis state. We compute one state
LLR per state
Sk , Λ(Sk |uk =0), for the incoming path that is
connected to state
Sk which corresponds to uk =0. In addition,
we also compute one state LLR per state
Sk, Λ(Skub = 1), for
the incoming path that is connected to state
Sk which corresponds
to
uk =1. The state LLR, Λ(Sk |ub =0), is defined as:

Λ( Sk ub = 0) = ak-1( Sk-1 ) + γ ( Sk-1 ,Sk ) + βk ( Sk ),     (4)

where the path from Sk-1 to Sk with ub =0is used in the
computation. Similarly, the state LLR,
Λ(Skub =1), is defined
as:

Λ(Skub = 1) = ak-1(Sk-1) + γ(Sk-1, Sk) + βk(Sk),     (5)

where the path from Sk-1 to Sk with ub =1is used in the
computation.

193




More intriguing information

1. The name is absent
2. AN ECONOMIC EVALUATION OF COTTON AND PEANUT RESEARCH IN SOUTHEASTERN UNITED STATES
3. Educational Inequalities Among School Leavers in Ireland 1979-1994
4. Passing the burden: corporate tax incidence in open economies
5. Campanile Orchestra
6. Tax Increment Financing for Optimal Open Space Preservation: an Economic Inquiry
7. Dual Inflation Under the Currency Board: The Challenges of Bulgarian EU Accession
8. The name is absent
9. Strategic Policy Options to Improve Irrigation Water Allocation Efficiency: Analysis on Egypt and Morocco
10. DIVERSITY OF RURAL PLACES - TEXAS
11. NEW DEVELOPMENTS IN FARM PRICE AND INCOME POLICY PROGRAMS: PART I. SITUATION AND PROBLEM
12. The name is absent
13. Gender and aquaculture: sharing the benefits equitably
14. The name is absent
15. The name is absent
16. A Rare Presentation of Crohn's Disease
17. The name is absent
18. Linking Indigenous Social Capital to a Global Economy
19. The name is absent
20. ANTI-COMPETITIVE FINANCIAL CONTRACTING: THE DESIGN OF FINANCIAL CLAIMS.