Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU



Provided by DSpace at Rice University

Implementation of a 3GPP LTE Turbo Decoder
Accelerator on GPU

Michael Wu, Yang Sun, and Joseph R. Cavallaro

Electrical and Computer Engineering

Rice University, Houston, Texas 77005

{mbw2, ysun, cavallar}@rice.edu

Abstract—This paper presents a 3GPP LTE compliant turbo
decoder accelerator on GPU. The challenge of implementing a
turbo decoder is finding an efficient mapping of the decoder
algorithm on GPU, e.g. finding a good way to parallelize workload
across cores and allocate and use fast on-die memory to improve
throughput. In our implementation, we increase throughput through
1) distributing the decoding workload for a codeword across
multiple cores, 2) decoding multiple codewords simultaneously
to increase concurrency and 3) employing memory optimization
techniques to reduce memory bandwidth requirements. In addition,
we analyze how different MAP algorithm approximations affect
both throughput and bit error rate (BER) performance of this
decoder.

I. Introduction

Turbo codes [1] have become one of the most important
research topics in coding theory for wireless communication
systems. As a practical code that approaches channel capacity,
turbo codes are widely used in many 3G and 4G wireless
standards such as CDMA2000, WCDMA/UMTS, IEEE 802.16e
WiMax, and 3GPP LTE (long term evolution). However, low
BER performance comes at a price - the inherently large
decoding latency and a complex iterative decoding algorithm
have made it very difficult to achieve high throughput in general
purpose processors or digital signal processors. As a result, turbo
decoders are often implemented in ASIC or FPGA [2-8].

The Graphic Processing Unit (GPU) is an another alternative
as it provides high computational power while maintaining
flexibility. GPUs deliver extremely high computation throughput
by employing many cores running in parallel. Similar to general
purpose processors, GPUs are flexible enough to handle general
purpose computations. In fact, a number of processing intensive
communication algorithms have been implemented on GPU.
GPU implementations of LDPC decoder are capable of real
time throughput [9]. In addition, both a hard decision MIMO
detector [10] as well as a soft decision MIMO detector [11]
have been implemented on GPU.

In this paper, we aim to provide an alternative - a turbo
decoder defined entirely in software on GPU that reduces the
design cycle and delivers good throughput. Particularly, we
partition the decoding workload across cores and pre-fetch data
to reduce memory stalls. However, parallelization of the decoding
algorithm can improve throughput of a decoder at the expense
of decoder BER performance. In this paper, we also provide
both throughput and BER performance of the decoder and show
that we can parallelize the workload on GPU while maintaining
reasonable BER performance. Although ASIC and FPGA designs
are more power efficient and can offer higher throughput than our
GPU design [12], this work will allow us to accelerate simulation
as well as to implement a complete iterative MIMO receiver in
software in wireless test-bed platform such as WARPLAB[13].

The rest of the paper is organized as follows: In section II
and section III, we give an overview of the CUDA architecture
and turbo decoding algorithm. In section IV, we will discuss the
implementation aspects on GPU. Finally, we will present BER
performance and throughput results and analyses in section V
and conclude in section VI.

II. Compute Unified Device Architecture (CUDA)

Compute Unified Device Architecture [14] is a software pro-
gramming model that allows the programmer to harness the
massive computation potential offered by the programmable
GPU. The programming model is explicitly parallel. The pro-
grammer explicitly specifies the parallelism, i.e. how operations
are applied to a set of data, in a kernel. At runtime, multiple
threads are spawned, where each thread runs the operations
defined by the kernel on a data set. In this programming model,
threads are completely independent. However, threads within a
block can share computation through barrier synchronization and
shared memory. Thread blocks are completely independent and
only can be synchronized through writing to the global memory
and terminating the kernel.

Compared to traditional general purpose processors, a pro-
grammable GPU has much higher peak computation throughput.
The computation power is enabled by many cores on the GPU.
There are multiple stream multiprocessors (SM), where each SM
is an 8 ALU single instruction multiple data (SIMD) core. A
kernel is mapped onto the device by mapping each thread block
to an SM. CUDA divides threads within a thread block into
blocks of 32 threads. When all 32 threads are doing the same
set of operations, these 32 threads, also known as a WARP,
are executed as a group on an SM over 4 cycles. Otherwise,
threads are executed serially. There are a number of reasons for
stalls to occur. As data is not cached, an SM can stall waiting
for data. Furthermore, the floating point pipeline is long and
register to register dependency can cause a stall in the pipeline.
To keep cores utilized, multiple thread blocks, or concurrent
thread blocks, are mapped onto an SM and executed on an SM
at the same time. Since the GPU can switch between WARP
instructions with zero-overhead, the GPU can minimize stalls by

978-1-4244-8933-6/10/$26.00 ©2010 IEEE


192


SiPS 2010




More intriguing information

1. The growing importance of risk in financial regulation
2. Public-Private Partnerships in Urban Development in the United States
3. The Response of Ethiopian Grain Markets to Liberalization
4. Centre for Longitudinal Studies
5. The Demand for Specialty-Crop Insurance: Adverse Selection and Moral Hazard
6. The name is absent
7. From Communication to Presence: Cognition, Emotions and Culture towards the Ultimate Communicative Experience. Festschrift in honor of Luigi Anolli
8. Searching Threshold Inflation for India
9. The name is absent
10. The name is absent
11. THE AUTONOMOUS SYSTEMS LABORATORY
12. Dynamiques des Entreprises Agroalimentaires (EAA) du Languedoc-Roussillon : évolutions 1998-2003. Programme de recherche PSDR 2001-2006 financé par l'Inra et la Région Languedoc-Roussillon
13. The name is absent
14. Wirtschaftslage und Reformprozesse in Estland, Lettland, und Litauen: Bericht 2001
15. Smith and Rawls Share a Room
16. The name is absent
17. The name is absent
18. Qualifying Recital: Lisa Carol Hardaway, flute
19. Robust Econometrics
20. Cross border cooperation –promoter of tourism development