Zhi-Wei Xu, Xiao-Jun Wu, Josef Kittler. STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition. International Journal of Automation and Computing, vol. 18, no. 5, pp.718-730, 2021. https://doi.org/10.1007/s11633-021-1289-9
Citation: Zhi-Wei Xu, Xiao-Jun Wu, Josef Kittler. STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition. International Journal of Automation and Computing, vol. 18, no. 5, pp.718-730, 2021. https://doi.org/10.1007/s11633-021-1289-9

STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition

doi: 10.1007/s11633-021-1289-9
More Information
  • Author Bio:

    Zhi-Wei Xu received the B. Eng. degree in computer science and technology from Harbin Institute of Technology, China in 2017. He is a postgraduate student at School of Artificial Intelligence and Computer Science, Jiangnan University, China. His research interests include computer vision, video understanding and action recognition. E-mail: zhiwei_xu@stu.jiangnan.edu.cn ORCID iD: 0000-0003-1472-431X

    Xiao-Jun Wu received the B. Sc. degree in mathematics from Nanjing Normal University, China in 1991. He received the M. Sc. and the Ph. D. degrees in pattern recognition and intelligent systems from Nanjing University of Science and Technology, China in 1996 and 2002, respectively. He is currently a professor in artificial intelligent and pattern recognition at the Jiangnan University, China. His research interests include pattern recognition, computer vision, fuzzy systems, neural networks and intelligent systems. E-mail: wu_xiaojun@jiangnan.edu.cn (Corresponding author) ORCID iD: 0000-0002-0310-5778

    Josef Kittler received the B. A. degree in electrical science tripos, Ph. D. degree in pattern recognition, and D. Sc. degree from University of Cambridge, UK in 1971, 1974, and 1991, respectively. He is a Distinguished Professor of machine intelligence at Centre for Vision, Speech and Signal Processing, University of Surrey, UK. He conducts research in biometrics, video and image database retrieval, medical image analysis, and cognitive vision. He published the textbook Pattern Recognition: A Statistical Approach and over 700 scientific papers. His publications have been cited more than 66000 times (Google Scholar). He is series editor of Springer Lecture Notes on Computer Science. He currently serves on the Editorial Boards of Pattern Recognition Letters, Pattern Recognition and Artificial Intelligence, and Pattern Analysis and Applications. He also served as a member of the Editorial Board of IEEE Transactions on Pattern Analysis and Machine Intelligence during 1982−1985. He served on the Governing Board of the International Association for Pattern Recognition (IAPR) as one of the two British representatives during the period 1982-2005, and President of the IAPR during 1994−1996. His research interests include robotics, feedback control systems, and control theory. E-mail: j.kittler@surrey.ac.uk ORCID iD: 0000-0002-8110-9205

  • Received Date: 2020-10-30
  • Accepted Date: 2021-02-05
  • Available Online: 2021-09-08
  • Publish Date: 2021-10-01
  • Learning comprehensive spatiotemporal features is crucial for human action recognition. Existing methods tend to model the spatiotemporal feature blocks in an integrate-separate-integrate form, such as appearance-and-relation network (ARTNet) and spatiotemporal and motion network (STM). However, with blocks stacking up, the rear part of the network has poor interpretability. To avoid this problem, we propose a novel architecture called spatial temporal relation network (STRNet), which can learn explicit information of appearance, motion and especially the temporal relation information. Specifically, our STRNet is constructed by three branches, which separates the features into 1) appearance pathway, to obtain spatial semantics, 2) motion pathway, to reinforce the spatiotemporal feature representation, and 3) relation pathway, to focus on capturing temporal relation details of successive frames and to explore long-term representation dependency. In addition, our STRNet does not just simply merge the multi-branch information, but we apply a flexible and effective strategy to fuse the complementary information from multiple pathways. We evaluate our network on four major action recognition benchmarks: Kinetics-400, UCF-101, HMDB-51, and Something-Something v1, demonstrating that the performance of our STRNet achieves the state-of-the-art result on the UCF-101 and HMDB-51 datasets, as well as a comparable accuracy with the state-of-the-art method on Something-Something v1 and Kinetics-400.

     

  • loading
  • [1]
    C. M. Bishop. Pattern Recognition and Machine Learning, New York, USA: Springer, 2006.
    [2]
    D. Michie, D. J. Spiegelhalter, C. C. Taylor. Machine Learning, Neural and Statistical Classification, Englewood Cliffs, USA Prentice Hall, 1994.
    [3]
    Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015. DOI: 10.1038/nature14539.
    [4]
    A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 1097−1105, 2012.
    [5]
    K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Dep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770−778, 2016.
    [6]
    C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 1−9, 2015.
    [7]
    J. W. Han, D. W. Zhang, G. Cheng, N. A. Liu, D. Xu. Advanced deep-learning techniques for salient and category-specific object detection: A survey. IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 84–100, 2018. DOI: 10.1109/Msp.2017.2749125.
    [8]
    J. Redmon, S. Divvala, R. Girshick, A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 779−788, 2016.
    [9]
    H. Noh, S. Hong, B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1520−1528, 2015.
    [10]
    E. Shelhamer, J. Long, T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017. DOI: 10.1109/TPAMI.2016.2572683.
    [11]
    J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6299−6308, 2017.
    [12]
    X. F. Ji, Q. Q. Wu, Z. J. Ju, Y. Y. Wang. Study of human action recognition based on improved spatio-temporal features. International Journal of Automation and Computing, vol. 11, no. 5, pp. 500–509, 2014. DOI: 10.1007/s11633-014-0831-4.
    [13]
    L. M. Wang, Y. Qiao, X. O. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4305−4314, 2015.
    [14]
    X. L. Wang, A. Farhadi, A. Gupta. Actions ~ Transformations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2658−2667, 2016.
    [15]
    K. Simonyan, A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 568−576, 2014.
    [16]
    L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20−36, 2016.
    [17]
    D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4489−4497, 2015.
    [18]
    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725−1732, 2014.
    [19]
    B. W. Zhang, L. M. Wang, Z. Wang, Y. Qiao, H. L. Wang. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2718−2726, 2016.
    [20]
    D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6450−6459, 2018.
    [21]
    B. Y. Jiang, M. M. Wang, W. H. Gan, W. Wu, J. J. Yan. STM: SpatioTemporal and motion encoding for action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 2000−2009, 2019.
    [22]
    Z. G. Tu, H. Y. Li, D. J. Zhang, J. Dauwels, B. X. Li, J. S. Yuan. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2799–2812, 2019. DOI: 10.1109/TIP.2018.2890749.
    [23]
    K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. [Online], Available: https://arxiv.orglabs/1409.1556, 2014.
    [24]
    G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 4700−4708, 2017.
    [25]
    I. Laptev. On space-time interest points. International Journal of Computer Vision, vol. 64, no. 2–3, pp. 107–123, 2005. DOI: 10.1007/s11263-005-1838-7.
    [26]
    H. Wang, C. Schmid. Action recognition with improved trajectories. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 3551−3558, 2013.
    [27]
    L. M. Wang, Y. Qiao, X. O. Tang. MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision, vol. 119, no. 3, pp. 254–271, 2016. DOI: 10.1007/s11263-015-0859-0.
    [28]
    X. L. Song, C. L. Lan, W. J. Zeng, J. L. Xing, X. Y. Sun, J. Y. Yang. Temporal-spatial mapping for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 3, pp. 748–759, 2020. DOI: 10.1109/Tcsvt.2019.2896029.
    [29]
    S. W. Ji, W. Xu, M. Yang, K. Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2012. DOI: 10.1109/TPAMI.2012.59.
    [30]
    J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4694−4702, 2015. DOI: 10.1109/CVPR.2015.7299101.
    [31]
    J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2625−2634, 2015.
    [32]
    S. J. Yan, Y. J. Xiong, D. H. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, USA, pp. 7444−7452, 2018.
    [33]
    C. Wu, X. J. Wu, J. Kittler. Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision Workshop, IEEE, Seoul, Korea, pp. 1740−1748, 2019.
    [34]
    H. S. Wang, L. Wang. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4382–4394, 2018. DOI: 10.1109/TIP.2018.2837386.
    [35]
    B. K. P. Horn, B. G. Schunck. Determining optical flow. Artificial Intelligence, vol. 17, no. 1−3, pp. 185–203, 1981. DOI: 10.1117/12.965761.
    [36]
    H. Sak, A. W. Senior, F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 338−342, 2014.
    [37]
    C. H. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Q. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, J. Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6047−6056, 2018.
    [38]
    R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In IEEE Proceedings of International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5843−5851, 2017.
    [39]
    L. M. Wang, W. Li, W. Li, L. Van Gool. Appearance-and-relation networks for video classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1430−1439, 2018.
    [40]
    Y. Lecun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. DOI: 10.1109/5.726791.
    [41]
    M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt. Sequential deep learning for human action recognition. In Proceedings of the 2nd International Workshop on Human Behavior Understanding, Springer, Amsterdam, The Netherlands, pp. 29−39, 2011.
    [42]
    L. Sun, K. Jia, D. Y. Yeung, B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4597−4605, 2015.
    [43]
    Z. F. Qiu, T. Yao, T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5533−5541, 2017.
    [44]
    R. Memisevic. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1829–1846, 2013. DOI: 10.1109/TPAMI.2013.53.
    [45]
    B. L. Zhou, A. Andonian, A. Oliva, A. Torralba. Temporal relational reasoning in videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 803−818, 2018.
    [46]
    R. H. Zeng, W. B. Huang, C. Gan, M. K. Tan, Y. Rong, P. L. Zhao, J. Z. Huang. Graph convolutional networks for temporal action localization. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7093−7102, 2019.
    [47]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Uszkoreit, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 5998−6008, 2017.
    [48]
    H. Z. Chen, G. H.Tian, G. L. Liu. A selective attention guided initiative semantic cognition algorithm for service robot. International Journal of Automation and Computing, vol. 15, no. 5, pp. 559–569, 2018. DOI: 10.1007/s11633-018-1139-6.
    [49]
    T. V. Nguyen, Z. Song, S. C. Yan. STAP: Spatial-temporal attention-aware pooling for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 1, pp. 77–86, 2015. DOI: 10.1109/Tcsvt.2014.2333151.
    [50]
    X. Long, C. Gan, G. De Melo, J. J. Wu, X. Liu, S. L. Wen. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7834−7843, 2018.
    [51]
    X. Zhang, Q. Yang. Transfer hierarchical attention network for generative dialog system. International Journal of Automation and Computing, vol. 16, no. 6, pp. 720–736, 2019. DOI: 10.1007/s11633-019-1200-0.
    [52]
    X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7794−7803, 2018.
    [53]
    C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI, San Francisco, USA, pp. 4278−4284, 2017.
    [54]
    Y. Z. Zhou, X. Y. Sun, C. Luo, Z. J. Zha, W. J. Zeng. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9829−9838, 2020.
    [55]
    H. S. Su, J. Su, D. L. Wang, W. H. Gan, W. Wu, M. M. Wang, J. J. Yan, Y. Qiao. Collaborative distillation in the parameter and spectrum domains for video action recognition. [Online], Available: https://arxiv.org/abs/2009.06902, 2020.
    [56]
    C. Feichtenhofer, H. Q. Fan, J. Malik, K. M. He. Slowfast networks for video recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 6201−6210, 2019.
    [57]
    M. Zolfaghari, K. Singh, T. Brox. ECO: Efficient convolutional network for online video understanding. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 695−712, 2018.
    [58]
    K. Soomro, A. R. Zamir, M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. [Online], Available: https://arxiv.org/abs/1212.0402, 2012.
    [59]
    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2556−2563, 2011.
    [60]
    X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, JMLR, Sardinia, Italy, pp. 249−256, 2010.
    [61]
    A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Yousefzadeh, J. Gall, L. Van Gool. Spatio-temporal channel correlation networks for action classification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 284−299, 2018.
    [62]
    S. N. Xie, C. Sun, J. Huang, Z. W. Tu, K. Murphy. Rethinking spatiotemporal feature learning for video understanding. [Online], Available: https://arxiv.org/abs/1712.04851, 2017.
    [63]
    J. Lin, C. Gan, S. Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7082−7092, 2019.
    [64]
    Y. S. Tang, J. W. Lu, J. Zhou. Comprehensive instructional video analysis: The COIN dataset and performance evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. DOI: 10.1109/TPAMI.2020.2980824.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(3)  / Tables(11)

    Article Metrics

    Article views (73) PDF downloads(49) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return