开启名 | AI芯片体系架构和软件专题告会2020

告会举办时间：

2020年4月12日（周日）

上午9:00-12:00

告会官方主页：

https://event.baai.ac.cn/con/architecture-and-software-design-for-ai-chip-2020/

（复制址至浏览器查看）

陈云霁

中科院计算所研究员

智源首席科学家

告会主席

梁云

北京大学研究员，智源青年科学家

高鸣宇

清华大学交叉信息研究院助理教授

清华大学交叉信息研究院助理教授，博士生导师。美国斯坦福大学电子工程系博士、硕士，清华大学微纳电子系学士。研究方向为计算机系统与体系结构，大数据系统优化，硬件系统安全。主要研究成果包括针对大数据应用的近数据处理架构软硬件系统，高密度低功耗可重构计算芯片，及专用神经络芯片的调度优化。已发表多篇国际顶级学术会议（ISCA、ASPLOS、HPCA、PACT 等）论文，授权多项专利。曾获得 IEEE Micro 2016年度计算机系统结构最佳论文奖（Top Picks）、欧洲HiPEAC论文奖、福布斯中国评选2019年“30 Under 30”等荣誉。

侯锐

中国科学院信息工程研究所研究员，博导

信息安全国家实验室副主任

会议日程

09:00-09:05

开场致辞 – 中科院计算所研究员，智源首席科学家陈云霁

09:05-09:35

梁云北京大学研究员，智源青年科学家

告题目：

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

告摘要：

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computa-tion and its huge computation cost has led to high demand for flexible, portable, and high-performance library imple-mentation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the al-gorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the appli-cation algorithms.

In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor compu-tation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level program-ming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines differ-ent exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for differ-ent hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x perfor-mance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.

09:40-10:10

美国南加州大学研究助理骆沁毅

告题目：

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

告摘要：

Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers. For this reason, it is significantly slower in heterogeneous settings. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds — designing a distributed training method that has both high performance like All-Reduce in homogeneous environment and good heterogeneity tolerance like AD-PSGD/p>

In this paper, we propose Prague, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization by exploring the interplay between algorithm and system implementation, or statistical and hardware efficiency. To reduce synchronization cost, we propose a novel communication primitive, Partial All-Reduce, that enables fast synchronization among a group of workers. To reduce serialization cost, we propose static group scheduling in homogeneous environment and simple techniques, i.e., Group Buffer and Group Division, to largely eliminate conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Prague is 1.2x faster than the state-of-the-art implementation of All-Reduce, 5.3x faster than Parameter Server and 3.7x faster than AD-PSGD. In a heterogeneous setting, Prague tolerates slowdowns well and achieves 4.4x speedup over All-Reduce.

10:20-10:50

清华大学交叉信息研究院助理教授高鸣宇

告题目：

Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators

告摘要：

We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide’s scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.

10:55-11:25

陈晓明中国科学院计算技术研究所副研究员

告题目：

Communication Lower Bound in Convolution Accelerators

告摘要：

In current convolutional neural network (CNN) accelerators, communication (i.e., memory access) dominates the energy consumption. This work provides comprehensive analysis and methodologies to minimize the communication for CNN accelerators. For the off-chip communication, we derive the theoretical lower bound for any convolutional layer and propose a dataflow to reach the lower bound. This fundamental problem has never been solved by prior studies. The on-chip communication is minimized based on an elaborate workload and storage mapping scheme. We in addition design a communication-optimal CNN accelerator architecture. Evaluations based on the 65nm technology demonstrate that the proposed architecture nearly reaches the theoretical minimum communication in a three-level memory hierarchy and it is computation dominant. The gap between the energy efficiency of our accelerator and the theoretical best value is only 37-87%.

11:30-12:00

侯锐中国科学院信息工程研究所研究员

告题目：

DNNGuard: An Elastic Heterogeneous Architecture for DNN Accelerator against Adversarial Attacks

告摘要：

Recent studies show that Deep Neural Networks (DNN) are vulnerable to adversarial samples that are generated by perturbing correctly classified inputs to cause the misclassification of DNN models. This can potentially lead to disastrous consequences, especially in security-sensitive applications such as unmanned vehicles, finance and healthcare. Existing adversarial defense methods require a variety of computing units to effectively detect the adversarial samples. However, deploying adversary sample defense methods in existing DNN accelerators leads to many key issues in terms of cost, computational efficiency and information security. Moreover, existing DNN accelerators cannot provide effective support for special computation required in the defense methods. To address these new challenges, we propose DNNGuard, an elastic heterogeneous DNN accelerator architecture that can efficiently orchestrate the simultaneous execution of original (target) DNN networks and the detect algorithm or network that detects adversary sample attacks. The architecture tightly couples the DNN accelerator with the CPU core into one chip for efficient data transfer and information protection. An elastic DNN accelerator is designed to run the target network and detection network simultaneously. Besides the capability to execute two networks at the same time, DNNGuard also supports the non-DNN computing and allows the special layer of the neural network to be effectively supported by the CPU core. To reduce off-chip traffic and improve resources utilization, we propose a dynamical resource scheduling mechanism. To build a general implementation framework, we propose an extended AI instruction set for neural networks synchronization, task scheduling and efficient data interaction.

12:00-12:05 告会总结

梁云北京大学研究员，智源青年科学家

名方式

扫码添加小源微信好友

发送关键词“live0412”后

进入名微信群，获取课件和直播间地址

或点击「阅读原文」前往告会官

声明：本站部分文章及图片源自用户投稿，如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢！

开启 名 | AI芯片体系架构和软件专题 告会2020

相关推荐

开启名 | AI芯片体系架构和软件专题告会2020