Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU



another function to support other standards. Furthermore, we
can define multiple interleavers and switch between them on-
the-fly since the interleaver is defined in software in our GPU
implementation.

E. max * Function

Both natural logarithm and natural exponential are supported
on CUDA. We support full-log-MAP as well as max-log-MAP
[18]. We compute full-log-MAP by:

m*ax(a, b) = max(a, b) + ln(1 + e-|b-a| )          (9)

and max-log-MAP is defined as:

m*ax(a, b) = max(a, b).                  (10)

Throughput of full-log-MAP will be slower than the throughput
of max-log-MAP. Not only is the number of instructions required
for full-log-MAP greater than the number of instructions required
for max-log-MAP, but also the natural logarithm and natural
exponential instructions take longer to execute on the GPU
compared to common floating operations, e.g. multiply and add.
An alternative is using a lookup table in constant memory.
However, this is even less efficient as multiple threads access
different entries in the lookup table simultaneously and only the
first entry will be a cached read.

V. BER Performance and Throughput Results

We evaluated accuracy of our decoder by comparing it against
a reference standard C Viterbi implementation. To evaluate the
BER performance and throughput of our turbo decoder, we tested
our turbo decoder on a Linux platform with 8GB DDR2 memory
running at 800 MHz and an Intel Core 2 Quad Q6600 running at
2.4Ghz. The GPU used in our experiment is the Nvidia TESLA
C1060 graphic card, which has 240 stream processors running
at 1.3GHz with 4GB of GDDR3 memory running at 1600 MHz.

A. Decoder BER Performance

Since our decoder can change P , which is the number of
sub-blocks to be decoded in parallel, we first look at how the
number of parallel sub-blocks affects the overall decoder BER
performance. In our setup, the host computer first generates
the random bits and encodes the random bits using a 3GPP
LTE turbo encoder. After passing the input symbols through
the channel with AWGN noise, the host generates LLR values
which are fed into the decoding kernel running on GPU. For this
experiment, we tested our decoder with
P =32, 64, 96, 128 for
a 3GPP LTE turbo code with
N = 6144. In addition, we tested
both full-log-MAP as well as max-log-MAP with the decoder
performing 6 decoding iterations.

Figure 4 shows the bit error rate (BER) performance of the
our decoder using full-log-MAP, while Figure 5 shows the BER
performance of our decoder using max-log-MAP. In both cases,
BER performance of the decoder decreases as we increase
P .
The BER performance of the decoder is significantly better
when full-log-MAP is used. Furthermore, we see that even with
parallelism of 96, where each sub-block is only 64 stages long,
the decoder provides BER performance that is within 0
.1dB of
the performance of the optimal case (
P =1).

Fig. 4: BER performance (BPSK, full-log-MAP)

Fig. 5: BER performance (BPSK, max-log-MAP)

B. Decoder Throughput

We measure the time it takes for the decoder to decode a batch
of 100 codewords using event management in the CUDA runtime
API. The runtimes measured include both memory transfers
and kernel execution. Since our decoder can support various
code sizes, we can decode
N =64, 1024, 2048, 6144 with
various numbers of decoding iterations and parallelism
P . The
throughput of the decoder is only dependent on
W = N as
decoding time is linearly dependent on the number of trellis
stages traversed. Therefore, we report the decoder throughput
as a function of
W which can be used to find the throughput
of different decoder configurations. For example, if
N = 6144,
P =64, and the decoder performs 1 iteration, the throughput of
the decoder is the throughput when
W =96. The throughput
of the decoder is summarized in Table III. We see that the
throughput of the decoder is inversely proportional to the number
of iterations performed. The throughput of the decoder after
m iterations can be approximated as T0/m, where T0 is the
throughput of the decoder after 1 iteration.

Although throughput of full-log-MAP is slower than max-log-
MAP as expected, the difference is small while full-log-MAP

196




More intriguing information

1. Climate Policy under Sustainable Discounted Utilitarianism
2. Benefits of travel time savings for freight transportation : beyond the costs
3. WP 36 - Women's Preferences or Delineated Policies? The development or part-time work in the Netherlands, Germany and the United Kingdom
4. Regional specialisation in a transition country - Hungary
5. La mobilité de la main-d'œuvre en Europe : le rôle des caractéristiques individuelles et de l'hétérogénéité entre pays
6. Using Surveys Effectively: What are Impact Surveys?
7. The name is absent
8. QUEST II. A Multi-Country Business Cycle and Growth Model
9. The name is absent
10. Visual Artists Between Cultural Demand and Economic Subsistence. Empirical Findings From Berlin.
11. Life is an Adventure! An agent-based reconciliation of narrative and scientific worldviews
12. The Value of Cultural Heritage Sites in Armenia: Evidence From a Travel Cost Method Study
13. The name is absent
14. Apprenticeships in the UK: from the industrial-relation via market-led and social inclusion models
15. Imitation in location choice
16. Parallel and overlapping Human Immunodeficiency Virus, Hepatitis B and C virus Infections among pregnant women in the Federal Capital Territory, Abuja, Nigeria
17. Strengthening civil society from the outside? Donor driven consultation and participation processes in Poverty Reduction Strategies (PRSP): the Bolivian case
18. Are Public Investment Efficient in Creating Capital Stocks in Developing Countries?
19. The name is absent
20. The name is absent