Abstract
Recently, research about artistic style transfer, which trains computers to be artists, become popular. Gatys et al. turned this task into an optimization problem and utilized convolution neural network to solve this problem. However, this method for image stylization doesn't work well for videos due to its failure to consider temporal consistency. To solve this problem, Ruder et al. proposed a method which integrated temporal loss into the loss function. But this method is pretty slow. Stylizing a 15-second-video takes more than 7.5 hours. Earlier this year, Johnson et al. made the image stylization procedure real-time by training a neural network for this optimization problem instead of optimizing each image separately. By combining the ideas of Ruder et al. and Johnson et al., we came up with a new method for video stylization, which keeps the temporal consistency but works about 10 times as fast as the method proposed by Ruder et al. Our method makes it possible to stylize movies and animations with reasonable time costs.
Samples
We’ve uploaded some sample videos to YouTube.
Sample Story (15s)
Style: The Starry Night
Big Buck Bunny (8m 2s)
With the help of Waifu2x super-resolution tool, we are able to make 1080p and 4K HD stylized videos without too much computational cost for stylization.
1080p
Style: The Starry Night
4K
Doraemon (22m 39s)
Styles (from left to right): Composition VII, The Great Wave off Kanagawa, The Starry Night
Compare to Other Methods
The following videos compare our method with other methods
Sample Story (15s)
Resolution: 640*480
Label (As in Video) | Method | Time |
---|---|---|
Ruder et al. (30 iterations) | Ruder's method, 30 iterations | 0.80 hour |
Johnson et al. (Real-time) | Simply run Johnson's method for each frame | 77.8 seconds |
Ruder et al. (1000 iterations) | Ruder's method, 1000 iterations | 7.56 hours |
Our method (30 iterations), no pixel loss | Our method, with pixel-loss weight 0 | 0.80 hour |
Our method (30 iterations), with pixel loss | Our method, with pixel-loss weight 1.5e-3 | 0.80 hour |
Big Buck Bunny (8m 2s)
Resolution: 960*540
Label (As in Video) | Method | Time |
---|---|---|
Naive | Simply run Johnson's method for each frame | / |
Without Pixel Loss | Our method, with pixel-loss weight 0 | 33 hours |
With Pixel Loss | Our method, with pixel-loss weight 1.5e-3 | 33 hours |
Doraemon (22m 39s)
Resolution: 640*480
Label (As in Video) | Method | Time |
---|---|---|
With Temporal Consistency | Our Method | 58 hours |
No Temporal Consistency | Simply run Johnson's method for each frame | 1.89 hours |
/ | Ruder et al. (1000 iterations) | (predict) ~23 days |
Notes
All time cost above does not contain time cost for calculating optical flow. This procedure can be paralleled with video stylization and calculating optical flow itself can be paralleled. In addition, we don't need to calculate optical flow again if we just want to change style for a video. We use CPU cluster to calculate optical flow. For Doraemon video, optical flow calculation cost ~50 hours.
References
L.A. Gatys, A.S. Ecker, M. Bethge, A Neural Algorithm of Artistic Style. arXiv:1508.06576
M. Ruder, A. Dosovitskiy, T. Brox, Artistic style transfer for videos. arXiv:1604.08610
J. Johnson, A. Alahi, L. Fei-Fei, Perceptual Losses for Real-Time Style Transfer and Super-Resolution. arXiv:1603.08155