Xian Z

15-418 Final Project Checkpoint

Bromide -- Fast CNN Inference In Halide

back to project index

Summary

Crossed out so far:

A set of different layers (fully-connected, convolution, activation, pooling etc.) that will be used in neural networks are implemented in Halide, as well as some helper 'layers'. A major action for each layer in inferencing is feed forward, so we modularized the implementation by taking out some functionalities (such as matrix flattening for fully-connected layer) as individual layers.

An important objective of the first phase of the project is to learn and explore the capabilities of Halide and adjust the predetermined plan and goals of the project. Given Halide is a language that presents users with succinct and obvious syntax to make decisions on the scheduling, i.e., trading off between memory efficiency, redundant work and parallelism, a lot of tuning is there to elevate the performance. However, for the same reason, since you have to make decisions yourself, doing inferencing on a number of deep networks would cost a serious amount of work, thus we decide to keep the training part off the table for this project.

General Adjustment:

Given the small amount of time and busy schedule, a framework for deep learning might not be feasible before the end of this semester. The objective now would be to implement deep networks in Halide efficiently by making use of Halide's scheduling capabilities. We are considering training networks through Caffe, of which the output will be used to build the deep network that we are going to build and perform inferencing in.

At first the way we want to address convolution is using FFT, however after some investigation and tests, it might not be a very good idea to opt to solve convolution instead of matrix multiplication in the frequency domain given most kernels are very small. We plan to spend some time on tuning the matrix multiplication part of the project based on the networks we are using in the next few days.

Goals and Deliverables (Revised)

Plan to achieve:

High performance inferring on CNN (native CPU and GPU). It includes convolution, pooling, normalization, activation, and fully-connected layers. We will try improving the performance of forwarding these layers with Halide.

High performance training on CNN (native CPU and GPU). It includes computing the derivatives of parameters in convolution, pooling, normalization, activation, and fully-connected layers. Again, we will use Halide to increase the performance.

To demonstrate, CNN inferring consumes most time on the convolution layers. So our focus will be on improving the performance of this layer. Three methods can be used:

Direct convolution with (some) Halide scheduling.
Converting convolution to matrix multiplication, and scheduling with Halide.
~~Converting convolution to dot product by FFT, and scheduling with Halide.~~ (Well, this might still be implemented, but only for comparsion)

Implementing a set of modern deep convolutional networks based on the classic implementations of inferring ~~and training~~ on CNN.
Analyzing the advantages and disadvantages of Halide, both in a general way and by comparing to Caffe (and some other implementations), to see what can be done to improve it (for example the support for Xeon Phi).

Hope to achieve:

Achieve better performance than Caffe.

Demo:

We would like to show a graphical interpretation of our scheduling methods for the networks we are implementing, as well as the corresponding results of them compared to naive (according to the original algorithms, without too much organization, but somewhat optimized) C++ code and Caffe (if applicable). The results would be composed of running time given input data and computation resources used.

Schedule (Revised)

4/20 - 4/22:

Decide which networks to use and get the training results (Ruizhou)
Learning the network configuration files from Caffe (Xian)

4/23 - 4/24:

Implement the network(s) using the result of the training of Caffe (Ruizhou)
Implement basic scheduling on convolution (Xian)

4/25 - 4/28:

Implement the network(s) using the result of the training of Caffe (Ruizhou)
Implement scheduling on convolution and tuning (Xian)

4/29 - 5/1:

Exploring parallelism details in Halide w.r.t. the project (Ruizhou)
Implement scheduling on other layers and tuning (Xian)
Run some tests on current networks (Ruizhou, Xian)

5/2 - 5/5:

Continue the scheduling and tuning task on (Ruizhou, Xian)
Run tests on current networks and do comparisons. (Compare performance with Caffe on CPU and GPU platforms) (Ruizhou, Xian)

5/5 - 5/9:

Doing further tests and corresponding adjustments (Ruizhou, Xian)
Preparation of demo (Ruizhou, Xian)