TSA in Information Sciences---Chinese Academy Of Sciences

YANG Xuejun

Xuejun Yang, male, professor at National University of Defense Technology, Ph.D. supervisor and academician of Chinese Academy of Sciences, now is President of National University of Defense Technology. He was born in Wucheng, Shandong Province in April, 1963. He received his Bachelor of Engineering degree from University of Nanjing Communication Engineering in 1983. He received his Ph.D. in engineering from National University of Defense Technology in 1990. He was promoted to Professor in 1995.

Professor Yang has been devoting himself to the research on high-performance computer system and system software. As the chief designer, he has led the design and manufacture of five high performance computer systems, as well as “TH-1”, which was ranked No.1 in the Top500 list published in Nov. 2010. He proposed the CPU-Stream Processor heterogeneous architecture for high-performance computing and efficiency optimization. Besides, he has world-class achievements in scalable shared memory architecture and high accuracy hyper-64-bit floating point computing. In addition, he has breakthrough in fast deployment of high performance parallel computer in battlefield, as well as system reliability technology. Professor Yang acquired the National Science Fund for Distinguished Young Scholars and the NSFC Fund for Innovative Research Groups. He has published more than 90 papers in top journals and conferences, such as IEEE Transactions, ACM Transactions, and ISCA, as the first or second author, among which 25 are indexed by SCI and 62 by EI. He received many awards, including the first prize in National Award for Science and Technology Progress for three times, the second prize in National Award for Technological Innovation once, and the Ho Leung Ho Lee Science and Technology Achievement Award, etc.

Research on Scalable Parallel Computing Technology

Abstract

It is difficult to build petascale supercomputer by using only general purpose CPUs, due to the great challenges from the scalability of control system, power consumption and interconnect. Therefore, there must have some breakthroughs in parallel architecture and interconnect techniques. The stream processing architecture gains currently the highest computing capability per unit of chip area, and it was adopted only in stream media processing. The project of Research on Scalable Parallel Computing Technology, for the first time, introduces stream processing to large-scale scientific and engineering computing, and systematically answers the question that whether a scientific computing program can be streamized or efficiently streamized.

The project of Research on Scalable Parallel Computing Technology aims at meeting the requirements of the Nation and the Army for large-scale scientific and engineering computing. It answers the above questions in terms of theory, technique, engineering, as well as hardware microprocessors, software optimization, and computer system. This project introduces stream processors, with high performance and low power consumption, into parallel system, designs the first CPU-GPU heterogeneous parallel architecture in the world, and manages to keep a good balance between computing speed and memory access efficiency. This project designed and fabricated the first 64-bit stream processor as accelerator for typical scientific applications. In 2007, a product system with 128 stream processors was deployed, which verified the feasibility and practicality of combining CPU with Stream Processor for large-scale scientific and engineering computing. Around the novel heterogeneous parallel architecture, this project proposed many approaches for performance optimization, including six conditions for decision of streamizability of a loop, streamization algorithm of a loop, stream data reuse model, a comparability graph coloring algorithm able to find optimal or near-optimal colorings for stream register file allocation, and so on. These techniques effectively enhanced the efficiency of the heterogeneous system up to 70.1% on LINPACK benchmarks, which results are world leading. The project made breakthroughs in the highly scalable interconnection network, designed effective communication protocol, the tile-based on-chip interconnect network and the high-density inter-chip interconnection network. The project designed the high-radix routing chip as well as the high-speed networking interface chip. The above techniques make the bidirectional bandwidth of interconnection links twice as fast as IB QDR, the mainstream commercial interconnection network product in the world, which successfully solves the problem of effective interconnection of more than 100,000 nodes. The project also broke through in designing multicore multithreaded architecture and on-chip parallel system, and developed a general purpose CPU, FT-1000, which has 8 cores and 64 threads, achieving a peak performance at 8 GFLOPS. All the above achievements provide a solid scientific and technological basis for the development of a series of supercomputers, including YH-Ⅲ, YH-X, YH-Y, YH-Z, YH-TENGYUE and “TH-1”, and enable the petascale system “TH-1A” ranked No.1 in Top500 list published in Nov. 2010. Bill Dally, a Fellow of American Academy of Arts and Sciencesand a Fellow of American Academy of Engineering, considers that GPU is one of the most promising way to achieve exascale computing.