Parallelized Deepflow

Evaluation Data & Platform

Any online video and any two consecutive images can be used as test data for our model. Please see our GitHub README for details if you want to prepare your data with your own video.

We evaluated our implementations with the sample video downloaded online (Bunny and NYC). The frames of videos (360p, 720p, 1080p) can be extracted with FFmpeg and the pairwise matches (360p, 720p, 1080p) can be generated using our helper script (HPC-ready). We generated both forward and backward flows for videos since both applications we presented use them. Therefore, if we have N frames in a video, we need to generate (N-1)*2 flows. We use Odyssey to generate matches.

For evaluating DeepFlow between two images, we generate forward flow between first two frames of the above videos (so we have 360p, 720p and 1080p images). The command is:

time ./deepflow2 out/frame_000001.ppm out/frame_000002.ppm flow.flo -match match/forward_1_2.match

For evaluating DeepFlow for processing videos, except MapReduce, we process pair by pair serially using a bash script. The accumulated time will be outputed on terminal. Since we know the time cost increase linearly as the video length increases by theory, we only report the average per second video processing time. Since all our video are with 25fps, that's the time to generate 50 flows (For 1-second length video, it's actually 48 flows. But with a video long enough, that's 50 flows per second on average). For videos of 360p, we run for 101 frames (the accumulated time divided by 4 is the average per second time). For video with 720p and 1080p, we run for 26 frames (the accumulated time is the average per second time). The command is:

bash video_flow.sh

To make fair comparison, we use AWS m4.2xlarge for all experiments except OpenACC (where we use AWS g3.4xlarge). The reason we don't use Odyssey is that it doesn't provide some necessary libraries and we don't have privilege to install them. Except MapReduce (where we use emr-5.13.0), we use Ubuntu Server 16.04 LTS (HVM) (ami-4e79ed36) as the operating system. All of original implementation and our implementations uses -O3 compile flag and default DeepFlow parameters.

A more detailed guide is provided in our GitHub README if you want to reproduce our test result.

Serial

The table below shows the processing time for a pair of images. "SOR Serial Optimized" is the original implementation. The optimization stands for serial optimization which reduces the number of calculations.

Time for generating flow between two images
	360p Images	720p Images	1080p Images
SOR Serial	6.494 seconds	26.997 seconds	59.326 seconds
SOR Serial Optimized	4.598 seconds	19.382 seconds	42.582 seconds
Jacobi Serial	3.870 seconds	16.574 seconds	35.765 seconds
Red-Black SOR Serial	3.687 seconds	15.812 seconds	34.476 seconds

From SOR to Jacobi or Red-Black SOR, we removed dependencies inside the big iteration (so iterations inside the big iteration becomes independent). With -O3 flag, compilers can probably optimize our code more. Therefore, it's not surprising that our Jacobi and Red-Black SOR is faster than even the optimized SOR. However, as discussed in the design page, the result of Jacobi is pretty different from the original implementation, which indicates Jacobi is probably not good to use. Further considering that Red-Black SOR is faster than Jacobi, we have totally no reason to use it. Before we came up with Red-Black SOR, we implemented the OpenMP and OpenACC versions of Jacobi. However, we didn't evaluate them due to the reason stated.

We also shows the video result below:

Average per-second video processing time (flow generation)
	360p Video	720p Video	1080p Video
SOR Serial	323 seconds	1316 seconds	2950 seconds
SOR Serial Optimized	226 seconds	926 seconds	2129 seconds
Jacobi Serial	192 seconds	788 seconds	1791 seconds
Red-Black SOR Serial	183 seconds	752 seconds	1725 seconds

Just by replacing original SOR solver with Red-Black SOR, we achieved 1.235 speedup. All parallel implementations below are based on the serial version of Red-Black SOR, and we choose the serial Red-Black SOR result as the baseline.

OpenMP

The following two tables show the OpenMP result.

Time for generating flow between two images
	360p Images	720p Images	1080p Images
1 thread (serial)	3.687 seconds	15.812 seconds	34.476 seconds
2 threads	2.518 seconds	11.588 seconds	25.253 seconds
4 threads	1.910 seconds	9.160 seconds	19.730 seconds
8 threads	2.048 seconds	9.171 seconds	19.883 seconds
8 threads on m4.4xlarge	1.698 seconds	7.959 seconds	17.174 seconds

To our surprise, the 8-thread result is worse than the 4-thread result. We investigated this and found we only have 4 actual cores in m4.2xlarge nodes (2 threads per core due to hyper-threading). This means if we have 2 threads in a core, the 2 threads will both slow down due to resources sharing. To show the general trend, we also tested on m4.4xlarge nodes where 8 cores (16 threads) are available. The result is normal (slightly faster than 4-thread).

Average per-second video processing time (flow generation)
	360p Video	720p Video	1080p Video
1 thread (serial)	183 seconds	752 seconds	1725 seconds
4 threads	102 seconds	414 seconds	1004 seconds

(Since testing on video is really time consuming, we only test with OMP_NUM_THREADS=4 as this performs best on m4.2xlarge node for 2 images)

The following plot shows the speedup for DeepFlow between two images. The left one is the speedup plot for m4.2xlarge node, the right one is replacing the 8-thread result by the 8-thread result on m4.4xlarge (taking out the hyper-threading factor). The right plot should represent the general trend. The red line represents the maximum theoretical speedup assuming there are no overheads. We can see we have similar trend but the speedup is lower. This is due to overhead caused by thread creation, termination and synchronization.

The following plot shows the improvement of OpenMP on video:

MPI (on 2 nodes)

Time for generating flow between two images
	360p Images	720p Images	1080p Images
MPI with 3 processes	3.245 seconds	12.711 seconds	32.605 seconds
MPI with 4 processes	2.849 seconds	11.188 seconds	28.665 seconds
MPI with 5 processes	2.567 seconds	10.216 seconds	25.040 seconds
MPI with 6 processes	2.533 seconds	10.070 seconds	24.113 seconds
MPI with 7 processes	3.402 seconds	10.672 seconds	22.835 seconds
MPI with 8 processes	3.339 seconds	14.570 seconds	22.869 seconds
MPI with 16 processes	4.457 seconds	14.616 seconds	32.863 seconds

We implemented MPI and ran our model on 2 AWS nodes (16 vCPUs since each m4.2xlarge provides 8 vCPUs). But since m4.2xlarge only provides 4 physical cores, the performance of MPI with 16 processes is worse than one with 8 processes, which agrees with the result of previous OpenMP. The speedup plot is shown below.

Although performance increases with the number of processes, MPI still works really badly, which is no surprise, since messages passed between nodes are huge and frequent due to frequent calls of Red-Black SOR, so the overhead is pretty large.

Notice that we are not dealing with a single matrix and performing Red-Black SOR on it. Instead, we are dealing with multiple matrices with different sizes, so that every time we call Red-Black SOR, we need to reassign jobs to different nodes, and that is how overhead accumulates.

We didn't evaluate MPI on videos due to its bad performance on 2 images.

MPI+OpenMP (on 3 nodes)

Time for generating flow between two images
	360p Images	720p Images	1080p Images
MPI with 3 processes+OpenMP with 4 threads	2.857 seconds	13.609 seconds	31.873 seconds

We also implemented MPI with OpenMP and ran our model on 3 AWS nodes (24 vCPUs since each m4.2xlarge provides 8 vCPUs) using SMP Nodes. Since we know on m4.2xlarge, 4 threads is best for OpenMP so we let the number of threads for OpenMP be 4 and the number of processes for MPI be 3, which indicates that we assign job to 3 nodes and on each of them we apply 4-thread OpenMP. The results are still bad especially for high resolutions where message passing will takes more time.

we didn't evaluate MPI+ OpenMP on videos due to its bad performance on 2 images.

OpenACC

Since PGCC does not support v4sf type, we rewrote the code. However, this will make the serial part of our code slower.

Time for generating flow between two images
	360p Images	720p Images	1080p Images
OpenACC	3.737 seconds	16.499 seconds	28.385 seconds

The CPU serial execution time is similar on g3.4xlarge node and m4.2xlarge node. We see compared to serial version, OpenACC is slower on 360p and 720p images and slightly faster on 1080p images. We think the reason is (1) the v4sf workaround mentioned above slows down the performance. (2) Our matrices are not big enough so the memory copy overhead plays an important role. This also explains why for 1080p images, OpenACC will have better performance.

According to the two-image result, we conclude OpenACC is not a good choice for DeepFlow parallelization. So we didn't evaluate OpenACC on video tasks.

MapReduce + OpenMP

MapReduce can only be applied to video tasks

Note: We called Amazon and they don't agree to increase our m4.2xlarge instances limit to 9. So the 8-node result is actually 4 m4.2xlarge instances and 4 r4.2xlarge. r4.2xlarge is the most similar instance with m4.2xlarge we can find and it's slightly more powerful than m4.2xlarge.

From the OpenMP result, we know 4 thread is the best. So we set OMP_NUM_THREADS=4 as environment when running the OpenMP implementation. As a further trick to speedup the execution, we assign a pair of images to each worker node and run forward and backward flow simultaneously. Thus using OpenMP implementation as baseline, the ideal speedup would be 2 * #Nodes

Average per-second video processing time (flow generation)
	360p Video	720p Video	1080p Video
Serial	183 seconds	752 seconds	1725 seconds
OpenMP	102 seconds	414 seconds	1004 seconds
2 nodes	159 seconds	280 seconds	556 seconds
4 nodes	83 seconds	193 seconds	307 seconds
8 nodes	48 seconds	112 seconds	147 seconds

The following plot shows the speedup using serial version as the baseline.

As shown in the figure, the speedup increase is quite linear. Due to the issue we mentioned before (r4.2xlarge is slightly more powerful than m4.2xlarge), from 4 nodes to 8 nodes, the speedup grows faster (without the issue, we expect linear speedup increase).

Since there are overheads like HDFS file copys in MapReduce, and those overheads are approximately constants. By theory, when more fraction of the time are spent in flow calculation, we will have a higher speedup. Therefore, we observe in the above plot that 360p video has the worst speedup while 1080p video has the best speedup.

Conclusion

For 2 consecutive images, we should use OpenMP parallelization.
For videos, we should use MapReduce+OpenMP parallelization. (Strong scaling, linear speedup)

Other methods (OpenACC, MPI) does not work well in our project because we are solving a same type of problems again and again while each individual problem is not large (we are calling linear solvers a lot of times with different linear systems and solving each individual is quick). Therefore, we're pretty sensitive to the introduced overhead. (We can't solve all linear systems together because there're dependencies.) Although MPI and OpenACC are powerful, they also introduce a lot of overheads while solving each problem. If the serial time cost is not high enough, we may not see significant speedup.