MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com

# A Multi-Standards HDTV Video Decoder for Blu-Ray Disc Standard

Noriyuki Minegishi, Hidenori Sato, Fumitaka Izuhara, Masayuki Koyama, Anthony Vetro

TR2009-041 September 2009

# Abstract

This paper presents an HDTV video decoder core that is able to decode MPEG-2, MPEG-4 AVC and VC-1 formats and is fully compatible with the Blu-ray Disc standard. The core has two major features which achieve low-cost hardware implementation for three different standards. First, a novel re-configurable architecture is adopted to realize reduced hardware for three different variable length coding tables. A hybrid architecture that consists of a cell array and coefficients memory is introduced. By considering the trade-off between performance and cost, the cell array can compare a variable number of bits in a flexible manner. Second, a data compression method suitable for all video decoding standards is applied to reduce memory data usage and access bandwidth. A compression syntax based on Run-Level encoding with Exp-Golomb tables is used to achieve both low compression workload and reasonable memory cost. The core is implemented by top-down HDL basis approach, and the circuit volume is 1.5Mgates with 90nm CMOS technology and operation clock frequency is 162MHz for 1080i at 30fr/s.

IEEE Transactions on Consumer Electronics

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

Copyright © Mitsubishi Electric Research Laboratories, Inc., 2009 201 Broadway, Cambridge, Massachusetts 02139



# A Multi-standards HDTV Video Decoder for Blu-ray Disc Standard

Noriyuki Minegishi, Hidenori Sato, Fumitaka Izuhara, Masayuki Koyama, Anthony Vetro, Senior Member, IEEE

Abstract— This paper presents an HDTV video decoder core that is able to decode MPEG-2, MPEG-4 AVC and VC-1 formats and is fully compatible with the Blu-ray Disc standard. The core has two major features which achieve low-cost hardware implementation for three different standards. First, a novel re-configurable architecture is adopted to realize reduced hardware for three different variable length coding tables. A hybrid architecture that consists of a cell array and coefficients memory is introduced. By considering the trade-off between performance and cost, the cell array can compare a variable number of bits in a flexible manner. Second, a data compression method suitable for all video decoding standards is applied to reduce memory data usage and access bandwidth. A compression syntax based on Run-Level encoding with Exp-Golomb tables is used to achieve both low compression workload and reasonable memory cost. The core is implemented by top-down HDL basis approach, and the circuit volume is 1.5Mgates with 90nm CMOS technology and operation clock frequency is 162MHz for 1080i at 30fr/s.

*Index Terms*—VLSI architecture, multi-standard video decoder, dynamic reconfigurable hardware.

# I. INTRODUCTION

SEVERAL video compression standards, e.g., MPEG-2, H.264/MPEG-4 AVC and Windows Media Video (VC-1), have been established and are used in practical applications such as recent terrestrial broadcast and high-compression optical disc. Semiconductor devices that meet these standards for multimedia applications are required to achieve high performance and cost effectiveness. Several solutions have been introduced [1]-[3], however none of them corresponded to high-compression optical disc standards, such as Blu-ray. To develop a chip for practical use, hardware size, memory usage and memory access bandwidth must be considered. We propose a multi-standard video decoder core that adopts dynamic and

F. Izumihara and M. Koyama are with Renesas Technology Corporation, Mizuhara 4-1, Itami 664-0005, Japan (e-mail: izumihara.fumitaka@renesas.com, koyama.masayuki@renesas.com).

A. Vetro is with Mitsubishi Electric Research Labs, Cambridge, USA (e-mail: avetro@merl.com).

static re-configurable techniques and a data compression method suitable for all video standards.

High compression of video streams is required for many types of consumer electronics products such as DVD, DTV, digital cameras and set-top boxes. Since most standard compression methods include transform and quantization techniques, block noise and ringing artifacts tend to appear at high compression ratios.

The rest of this paper is organized as follows. An overview of the video core architecture is described in Section II. Section III describes the proposed dynamic re-configurable variable-length coding (VLC) table. Section IV describes the data compression method and corresponding syntax. Implementation results are presented in Section V. The conclusions are presented in Section VI.

# II. OVERVIEW OF THE CORE ARCHITECTURE

This section describes requirements and issues for video decoders and the corresponding architecture to solve these issues. To satisfy the requirements for multimedia services and consumer products, the video decoder should be capable of decoding the format of multiple standards and achieve both high performance and low cost. To realize these demands, several design issues should be considered. For the entropy decoding function, the design must accommodate different VLC tables and the performance should meet the bit-rate limits specified by the standard. Data bandwidth and capacity of external memory should also be minimized to realize a low-cost product. Cost effectiveness should be considered for all decoder function blocks for low cost chip.

To solve the above issues, we proposed the video decoder architecture shown in Figure 1. Considering the profile requirements of the Advanced profile of VC-1 and High profile of AVC, real-time entropy decoding cannot be realized with practical clock frequency. Therefore, the overall decode operation was divided into 2 parts: the VLC decode section and pixel operation section. The VLC decoding achieves a maximum bit-rate (40Mbits/sec) and the pixel operation achieves a maximum frame size and frame-rate (1080i with 30fr/s). In fact, this partitioned architecture can realize the maximum Blu-ray specification performance with 162MHz clock frequency.

A hybrid architecture is adopted for VLC decoding to realize

N. Minegishi and H. Sato are with Mitsubishi Electric Corporation, Information Technology R&D Center, Ofuna 5-1-1, Kamakura 247-8501, Japan (e-mail: <u>Minegishi.Noriyuki@aj.MitsubishiElectric.co.jp</u>, <u>Shimada.Toshiaki@ap.MitsubishiElectric.co.jp</u>).

both flexibility and high performance. During VLC decoding, a dynamic re-configurable VLC table is introduced to minimize hardware for quite different VLC tables specified by each video standard. Moreover, a data compression method that is based on Exp-Golomb codes is applied and implemented in the data buffer blocks. To prevent an empty buffer, the VLC decoding must be performed fast enough. The data compression function reduces external memory usage and access bandwidth between the core and external memory to satisfy this requirement.



Figure 1. Overview of the proposed video decoding core.

### III. A DYNAMIC RE-CONFIGURABLE VLC TABLE

A key design issue for entropy decoding in our architecture is to realize different VLC tables with a low-cost and high performance implementation. Therefore, we adopted a dynamic re-configurable hardware. Several types of dynamic re-configurable hardware have been introduced [4]-[10], however none of them are suitable to support the entropy decoding function. In our proposed scheme, a VLC table with comparison cell array is designed, where the necessary information of each cell can be dynamically reconfigured. Considering the dynamic re-configurable VLC table hardware, two key aspects should be accounted for: (i) determining the number of cells, and (ii) the storage of information such as comparison data, coefficients and routing information.

# A. Dynamic re-configurable VLC Table Architecture

Figure 2 shows a block diagram of the dynamic reconfigurable VLC table corresponding to the video standards of interest. The table consists of an array of matching elements (PEs) that compare the input bit-stream with a stored binary pattern, an address decoder, a memory that contains table configuration data and decoding results, and control logic. The bit stream data is commonly routed for each cell. This feature avoids line congestion thus routing area is smaller. The bit stream is compared with immediate data by defined cell. It doesn't need register to be compared, thus it achieves a reduced cost. We set the comparison bit width from 1-bit to 4-bits. This grouping achieves best effort for both low cost and high performance.

Each cell outputs its own number if the input value is matched. If not, the cell outputs a "0" value. In this way, the "matched PE number" doesn't need a selector; it simply consists of an "OR" tree. This architecture also helps to reduce cost. Mapping information and coefficients are contained in the memory. This design provides both flexibility and low cost.



Figure 2. Dynamic re-configurable VLC Table block diagram.

# B. Decoding Example

Figure 3 shows an example of how a comparison mapping on the PE array is performed. An entropy coding table is considered as a tree search structure. Shorter bit length codes are assigned for higher probability and placed in upper nodes of the tree. According to our MPEG-2 video sequence simulations, about 40% are covered within 4 bits. Hence, 4 bit comparisons are chosen as fair trade-off between hardware and performance. The variable length decoding process is described below.

At the beginning, the PE group identifier "R0" and "R2" in Figure 2 is activated, then PE0, 1, 6 to 13 is indicated to compare nodes "n0" to "n4" as shown in Figure 3. If PE9 which is assigned to "n3" and a branch node is matched, the information in on-chip-memory has changed. The table hardware has dynamically re-configured and continues to search for 2nd row of the VLC table tree.

For the second row comparison, the control logic disables "R0" and "R2" and activates "R1" and "R4" as shown in Figure 2. Then PE2 to 5 and 22 to 29 in Figure 3 are indicated and node "n5" to "n13" are compared. If PE2 which is the terminal node of the VLC table is matched, the on-chip-memory outputs a coefficient value. Then, the VLC table hardware returns to its initial state and begins to search for the next code.



Figure 3. VLC Table mapping and decode example.

#### IV. AN INTERMEDIATE DATA COMPRESSION METHOD

To meet required performance of the Advanced profile of VC-1 and High profile of AVC with a practical clock frequency, the decode operation is divided into the VLC decode section and pixel operation section. This architecture needs to store intermediate data, which has a high data volume. Therefore, a data compression method is introduced to minimize memory.

Regarding the data compression method, two key aspects should be considered. One is the work load for the compression and the other one is memory cost. The proposed approach attempts to provide an optimal balance between these two.

### A. Data Compression Method Syntax

Table 1 shows a sample compression syntax for CABAC coefficients. The run-level and Exponential-Golomb methods are applied to minimize workload.

| Table 1. | Compressio | on syntax fo | or CABAC | Coefficient |
|----------|------------|--------------|----------|-------------|
|          |            | 2            |          |             |

|                         | bits |                      |
|-------------------------|------|----------------------|
| block_CABAC(){          |      |                      |
| while(LEVEL!=EOB){      |      |                      |
| (first RUN)= numCoeff-1 |      |                      |
| RUN                     | 1-13 | Unsigned ExpGolomb   |
| LEVEL                   | 1-31 | Signed ExpGolomb     |
|                         |      | +FLC(cmax=14,FLC=16) |
| }                       |      |                      |
| }                       |      |                      |

Considering compression rate and encode-decode performance, the Exp-Golomb algorithm is applied. However, the Exp-Golomb compression method is not efficient for large values. Hence, we set 14 bits as the length limitation for LEVEL data compression, which was empirically determined.

The Exp-Golomb table with fixed length code is shown in Table 2. The Value "1" is an indicator and "0" strings before "1" denote the bit-length to be decoded. This compression method is applied for transform coefficient and motion vector data. Since the coefficient value should be considered a signed value, the signed Exp-Golomb code is adopted.

|  | Table 2 | . Exp | -Golomb | o table | with | fixed | length | code |
|--|---------|-------|---------|---------|------|-------|--------|------|
|--|---------|-------|---------|---------|------|-------|--------|------|

| Bitstring form                        | Range of codeNum |
|---------------------------------------|------------------|
| 1                                     | 0                |
| 0 1 X <sub>0</sub>                    | 1-2              |
| $0\ 0\ 1\ X_0\ X_1$                   | 3-6              |
| $0\ 0\ 0\ 1\ X_0\ X_1\ X_2$           | 7-14             |
|                                       |                  |
| 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |                  |

#### B. Run Count Example

Figure 4 shows a run-count example in which the proposed method is applied. In many cases, non-zero coefficients appear early in the scan order. Hence, to indicate the position of the last non-zero coefficient, a forward scan is initially applied. This method typically gives a smaller first run value compared to scanning backward. However, for the rest of the coefficients, a backward scan is used.



Figure 4. VLC Table mapping and decode example.

In the example, first RUN is "7" since number of coefficients before the last coefficient counted from forward scan is seven, and LEVEL is "1". Next "0" coefficients are counted as backward, RUN is "2" and LEVEL is "5". The syntax is continued until the first coefficient is reached.

# V. IMPLEMENTATION AND RESULT

The core is implemented with top-down approach on HDL basis. We have carried out the HDL synthesis with 90nm CMOS ASIC library. The circuit volume of the core is 1.5MGates and maximum operation clock frequency is 162MHz.

With the proposed design, a Blu-ray Disc video decoder which supports full HDTV resolution and bit-rates up to 40Mbps is realized. By applying our dynamic re-configurable VLC table, the circuit size of the table is reduced by 60% compared with a conventional hard-wired logic implementation. We measured the proposed data compression method with over 300 video sequences. The memory data usage is reduced 50% and access bandwidth is improved by 12%. In fact two 512Mbit DDR2 SDRAM with 324MHz operation can be applied.

# VI. CONCLUSION

A multi-standards video decoder for Blu-ray Disc standard was introduced. The decoder corresponds to the Blu-ray Disc standard which requires 40Mbps maximum bit-rate and 1080i with 30fr/s resolution. The supported video standards include MPEG-2 Main profile at High level, H.264 High profile at Level 4.1, and VC-1 Advanced profile at Level 3. The decoder realizes a low-cost LSI implementation. The gate count is 1.5M gates and the operation clock frequency is 162MHz for all video standards at HDTV resolutions. A novel circuit methodology for dynamic re-configurable VLC tables is introduced, which has been shown to reduce circuit volume by 60% compared with a conventional hardwired logic implementation at a bit rate of 40Mbps for entropy decoding. An original data compression method is also applied to realize both low cost and real-time performance. The proposed method utilizes a RUN-LEVEL syntax and signed Exp-Golmb code table to reduce data usage by 50% and improve access bandwidth by 12% with negligible workload. Table 3 provides a summary of this work.

Table 3. Multi-standards HDTV Video Decoder Summary.

| Corresponding Standards | Blu-ray disc standard |
|-------------------------|-----------------------|
| Max. bit-rate           | 40Mbps                |
| Max. resolution, rate   | 1080i, 30fr/s         |
| Video Standards         | MPEG-2: MP@HL         |
|                         | H.264 : HP@L4.1       |
|                         | VC-1 : AP@L3          |
| Number of Gate          | 1.5 Million Gates     |
| Max. Clock Frequency    | 162MHz                |

#### REFERENCES

- T-M. Liu, et al, "A 125uW Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications," *IEEE Int'l Solid-State Circuit Conf.*, Feb. 2006.
- [2] C-D. Chien, et al, "A 252kgate/71mW Multi-Standard Multi-Channel Video Decoder for High Definition Video Applications," *IEEE Int'l Solid-State Circuit Conf.*, Feb. 2007.
- [3] Y-S. Tung, et al, "DSP-Based Multi-Format Video Decoding Engine for Media Adapter Applications," ICCE 2005.
- [4] H. Ito, R. Konishi, H. Nakada, H. Tsuboi, and A. Nagoya, "Dynamically Reconfigurable Logic LSI designed as Fully Asynchronous System – PCA-2", Proc. Notebook for COOL Chips VI, Apr. 2003.
- [5] T. Sato, "Dynamically Reconfigurable Processor DAP/DNA-2 and Development DAP/DNA-FW", Proc. Notebook for COOL Chips VII, Apr. 2004.
- [6] K. Furuta, T. Fujii, M. Motomura, K. Wakabayashi, and M. Yamashina, "Spatial-temporal mapping of real applications on a dynamically reconfigurable logic engine (DRLE) LSI", Proc. of IEEE Custom Integrated Circuits Conference, May 2000.

- [7] C. Pretty, and J.G. Chase, "Reconfigurable DSP's for Efficient MPEG-4 Video and Audio Decoding", Proc. of IEEE International Workshop on Electronic Design, Test and Applications, Jan. 2002.
- [8] J. Becker, A. Thomas, and M. Scheer, "Efficient Processor Instruction Set Extension by Asynchronous Reconfigurable Datapath Integration", Proc. of IEEE Symposium on Integrated Circuits and Systems Design, May 2003.
- [9] M. Okada, T. Hiramatsu, H. Nakajima, M. Ozone, K. Hirase, and S. Kimura, "A Reconfigurable Processor based on ALU array architecture with limitation on the interconnection", Proc. of IEEE International Parallel and Distributed Processing Symposium, Apr. 2005.
- [10] Hsiu-Cheng Chang, Chien-Chang Lin and Jiun-In Guo, "A Novel Low-Cost High-Performance VLSI Architecture for MPEG-4 AVC/H.264 CAVLC Decoding", Proc. of IEEE International Symposium on Circuit and Systems, May 2005.



**Noriyuki Minegishi** graduated from Toin Technical College, Yokohama, Japan, in 1985. He received a Ph.D in Electrical Engineering and Computer Science from Kanazawa University in 2005. He joined Mitsubishi Electric Corporation in 1985. From 1985 to 1999, he was engaged architecture design, circuit design, and verification for chips, including design for test, and chip development methodologies, for communication, video processing, error correcting, encryption, and industrial

computers. Since 1999, he has been involved research and development of multimedia processing SOCs, especially for video processing. He is a member of the Information Technology R&D Center, Kamakura, Japan. From 2005, He is a member of Technical Committee on Circuits and Systems, IEICE.



**Fumitaka Izuhara** received the B.S. and M.S. degrees in Computer Science and Electronics from Kyushu Institute of Technology, Fukuoka, Japan, in 1995 and 1997, respectively. He joined the Hitachi ULSI systems co., Tokyo, Japan, in 1997. In 2003, he transferred to the Renesas Technology Corp., Tokyo, Japan. Now, he is a Engineer of the SystemDesignDept.3, SystemDesign Div, Renesas Technology Corporation.



Masayuki Koyama was born in Okayama, Japan, in 1963. He received the B.E. degree in electrical engineering from Science University of Tokyo, Japan, in 1987. In 1987, he joined Matsushita Electric Corporation, Moriguchi, Japan, where he had been engaged in the research and development of MPU design. In 1990, he joined Mitsubishi Electric Corporation, Japan. He is currently with the MCU Design div., Renesas Technology Corporation, Japan.

Since 1987 he has been engaged in the research and development of digital signal processing VLSI design.



Anthony Vetro (S'92-M'96-SM'04) received the B.S., M.S. and Ph.D. degrees in Electrical Engineering from Polytechnic University, Brooklyn, NY. He joined Mitsubishi Electric Research Labs, Cambridge, MA, in 1996, where he is currently a Group Manager responsible for multimedia technology. He has published more than 130 papers and has been an active member of the MPEG and JVT standardization. Dr.

Vetro serves on the program committee for various conferences and has held several editorial positions. He is currently Chair of the Technical Committee on Multimedia Signal Processing of the IEEE Signal Processing Society, he served as Conference Chair for ICCE 2006, and served as a member of the Publications Committee of the IEEE TRANSACTIONS ON CONSUMER ELECTRONICS. Dr. Vetro has also received several awards for his work on transcoding, including the 2003 IEEE Circuits and Systems CSVT Transactions Best Paper Award and the 2002 Chester Sall Award. He is a Senior Member of the IEEE.