Provided by DSpace at Rice University
Implementation of a 3GPP LTE Turbo Decoder
Accelerator on GPU
Michael Wu, Yang Sun, and Joseph R. Cavallaro
Electrical and Computer Engineering
Rice University, Houston, Texas 77005
{mbw2, ysun, cavallar}@rice.edu
Abstract—This paper presents a 3GPP LTE compliant turbo
decoder accelerator on GPU. The challenge of implementing a
turbo decoder is finding an efficient mapping of the decoder
algorithm on GPU, e.g. finding a good way to parallelize workload
across cores and allocate and use fast on-die memory to improve
throughput. In our implementation, we increase throughput through
1) distributing the decoding workload for a codeword across
multiple cores, 2) decoding multiple codewords simultaneously
to increase concurrency and 3) employing memory optimization
techniques to reduce memory bandwidth requirements. In addition,
we analyze how different MAP algorithm approximations affect
both throughput and bit error rate (BER) performance of this
decoder.
I. Introduction
Turbo codes [1] have become one of the most important
research topics in coding theory for wireless communication
systems. As a practical code that approaches channel capacity,
turbo codes are widely used in many 3G and 4G wireless
standards such as CDMA2000, WCDMA/UMTS, IEEE 802.16e
WiMax, and 3GPP LTE (long term evolution). However, low
BER performance comes at a price - the inherently large
decoding latency and a complex iterative decoding algorithm
have made it very difficult to achieve high throughput in general
purpose processors or digital signal processors. As a result, turbo
decoders are often implemented in ASIC or FPGA [2-8].
The Graphic Processing Unit (GPU) is an another alternative
as it provides high computational power while maintaining
flexibility. GPUs deliver extremely high computation throughput
by employing many cores running in parallel. Similar to general
purpose processors, GPUs are flexible enough to handle general
purpose computations. In fact, a number of processing intensive
communication algorithms have been implemented on GPU.
GPU implementations of LDPC decoder are capable of real
time throughput [9]. In addition, both a hard decision MIMO
detector [10] as well as a soft decision MIMO detector [11]
have been implemented on GPU.
In this paper, we aim to provide an alternative - a turbo
decoder defined entirely in software on GPU that reduces the
design cycle and delivers good throughput. Particularly, we
partition the decoding workload across cores and pre-fetch data
to reduce memory stalls. However, parallelization of the decoding
algorithm can improve throughput of a decoder at the expense
of decoder BER performance. In this paper, we also provide
both throughput and BER performance of the decoder and show
that we can parallelize the workload on GPU while maintaining
reasonable BER performance. Although ASIC and FPGA designs
are more power efficient and can offer higher throughput than our
GPU design [12], this work will allow us to accelerate simulation
as well as to implement a complete iterative MIMO receiver in
software in wireless test-bed platform such as WARPLAB[13].
The rest of the paper is organized as follows: In section II
and section III, we give an overview of the CUDA architecture
and turbo decoding algorithm. In section IV, we will discuss the
implementation aspects on GPU. Finally, we will present BER
performance and throughput results and analyses in section V
and conclude in section VI.
II. Compute Unified Device Architecture (CUDA)
Compute Unified Device Architecture [14] is a software pro-
gramming model that allows the programmer to harness the
massive computation potential offered by the programmable
GPU. The programming model is explicitly parallel. The pro-
grammer explicitly specifies the parallelism, i.e. how operations
are applied to a set of data, in a kernel. At runtime, multiple
threads are spawned, where each thread runs the operations
defined by the kernel on a data set. In this programming model,
threads are completely independent. However, threads within a
block can share computation through barrier synchronization and
shared memory. Thread blocks are completely independent and
only can be synchronized through writing to the global memory
and terminating the kernel.
Compared to traditional general purpose processors, a pro-
grammable GPU has much higher peak computation throughput.
The computation power is enabled by many cores on the GPU.
There are multiple stream multiprocessors (SM), where each SM
is an 8 ALU single instruction multiple data (SIMD) core. A
kernel is mapped onto the device by mapping each thread block
to an SM. CUDA divides threads within a thread block into
blocks of 32 threads. When all 32 threads are doing the same
set of operations, these 32 threads, also known as a WARP,
are executed as a group on an SM over 4 cycles. Otherwise,
threads are executed serially. There are a number of reasons for
stalls to occur. As data is not cached, an SM can stall waiting
for data. Furthermore, the floating point pipeline is long and
register to register dependency can cause a stall in the pipeline.
To keep cores utilized, multiple thread blocks, or concurrent
thread blocks, are mapped onto an SM and executed on an SM
at the same time. Since the GPU can switch between WARP
instructions with zero-overhead, the GPU can minimize stalls by
978-1-4244-8933-6/10/$26.00 ©2010 IEEE
192
SiPS 2010