Xian Z

15-418 Final Project Proposal

Bromide -- Fast CNN Inference In Halide

back to project index

Summary

We are building a framework for deep learning using the Halide image processing language. It is aimed for inference on deep neural networks, and is expected to achieve high performance on CPU.

Background

Lots of modern machine learning applications, such as AlphaGo, are using Convolutional Neural Networks (CNNs) for classification tasks. However, the performance of training and inferring with CNNs really matters. For huge data set and a large CNN configuration, training can take several days, making it urgent to speed up the training process. Also, in real-time applications, such as Apple Siri, the inference time is also expected to be short, so that the delay will be tolerable for app users.

Halide is a language and compiler for optimizing parallelism, locality, and re-computation in image processing pipelines. Since it provides many scheduling interfaces to handle matrices, it is a great source for CNN implementation, where the intermediate results are also matrices. Therefore, by carefully scheduling CNN training and inference with Halide, an obvious speedup is expected.

Challenge

Halide is a language targeted at image processing, which provides a set of computation scheduling methods to help programmers exploit data locality and achieve higher performance when writing parallel programs. However, since it is not designed for deep learning methods, it does not have intrinsic support for the typical computation patterns or algorithms in deep learning, for example, convolution.  Nowadays there are many a mature (sort of) implementation of deep learning methods.

For example, Caffe, is a frequently seen framework in the world of deep neural networks. One feature about Caffe is that it takes advantage of a fast matrix multiplication library like Intel's MKL when running convolutional neural networks by using matrix multiplications. However, breaking down the convolutional layers to matrix multiplications might not produce a desirable performance speedup since Halide does not natively support matrix multiplication. We think using FFT could avoid both the original complex convolution computations and the need for fast matrix multiplication. Although doing the DFT and its inverse will take some time but this might be better than the matrix multiplications way. So the challenge here is how to schedule and tune the program using Halide to achieve admissible performance on CPU or/and GPU, compared to the current matrix multiplication approach.  Another challenge would be to implement a set of complex modern deep neural networks using Halide. Halide, as a image processing language, is likely to have some constraints or limitation in the elevation of performance in the deep learning algorithms. It might be a challenge to use its scheduling methods to achieve desirable speedup.

Resources

Halide source code: https://github.com/halide/Halide.
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F. and Amarasinghe, S., 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6), pp.519-530.
(Can we get to learn about the automatic scheduling Halide program from Ravi Mullapudi?)

Goals and Deliverables

Plan to achieve:

High performance inferring on CNN (native CPU and GPU). It includes convolution, pooling, normalization, activation, and fully-connected layers. We will try improving the performance of forwarding these layers with Halide.

High performance training on CNN (native CPU and GPU). It includes computing the derivatives of parameters in convolution, pooling, normalization, activation, and fully-connected layers. Again, we will use Halide to increase the performance.

To demonstrate, CNN inferring consumes most time on the convolution layers. So our focus will be on improving the performance of this layer. Three methods can be used:

Direct convolution with Halide scheduling.
Converting convolution to matrix multiplication, and scheduling with Halide.
Converting convolution to dot product by FFT, and scheduling with Halide.

Hope to achieve:

Implementing a set of modern deep convolutional networks based on the classic implementations of inferring and training on CNN Analyzing the advantages and disadvantages of Halide, both in a general way and by comparing to Caffe (and some other implementations), to see what can be done to improve it (for example the support for Xeon Phi).

Demo:

We would like to show a graphical interpretation of our scheduling methods for the networks we are implementing, as well as the corresponding results of them compared to naive (according to the original algorithms, without too much organization, but somewhat optimized) C++ code and Caffe (if applicable). The results would be composed of running time given input data and computation resources used.

Platform

GHC Machines: ghc[47-84].ghc.andrew.cmu.edu (Intel® Core™ i7-4785T Processor, 4 cores, 8 threads, SSE4.1/4.2, AVX 2.0)
ghc41.ghc.andrew.cmu.edu (CUDA cores: 2304)

Schedule

4/1 - 4/8:

Get familiar with Halide, by studying the toturials;
Implement the inference process of CNN with Halide: including convolution layers with scheduling and FFT-based approaches, and pooling, normalization, activation and fully connected layers.

4/9 - 4/15:

Implement training process of CNN with Halide;
Trying using the automatic scheduling tool by Ravi on our program.

4/16 - 4/22:

Select some modern deep neural networks and implement them with Halide.

4/23 - 4/29:

Compare performance with Caffe on CPU and GPU platforms, in terms of training and inference.