Remote Heart Rate Estimation Using Attention-targeted Self-Supervised Learning Methods

Jaechoon Jo, Yeo-chan Yoon


Heart rate measurement is a crucial factor for assessing the overall health status of an individual. Abnormal heart rates, whether lower or higher than baseline, can indicate potential pathological or physiological abnormalities. As a result, it is necessary to have reliable technology for monitoring heart rates in various fields, including medicine, biotechnology, and healthcare. With recent advancements in deep learning research, it is now possible to monitor heart rate conveniently and hygienically without specialized equipment, using facial video photo volume measurement. This new technology employs a deep learning-based video analysis method that requires a large data set to achieve high performance. However, collecting and labeling a vast amount of data is often impractical and costly. Therefore, researchers have been searching for alternative ways to achieve high performance with smaller datasets. This paper proposes a novel self-supervised learning approach suitable to the face video process. Our proposed method can effectively acquire a deep latent expression from a face image sequence and apply it to a target task through transfer learning. Using this method, we aim to improve the remote heart rate estimation performance in a limited-size dataset. Our proposed method is specialized for facial image sequences and focuses on the color change of the face to achieve high performance in existing attention-based deep learning models. The proposed self-supervised learning method has several advantages. First, it can learn useful features from unlabeled data, reducing the reliance on annotated datasets. Second, it can help overcome the problem of insufficient labeled data in specific domains, such as medical image analysis. Third, the proposed method can improve the performance of the target task using pre-trained models on different datasets. Finally, our approach improves the remote heart rate estimation performance by extracting useful features from facial images.


rPPG; self-supervised learning; facial video analysis

Full Text:



Kebe, Mamady, et al. "Human vital signs detection methods and potential using radars: A review." Sensors 20.5 (2020): 1454.

Xiao, Yuxuan, et al. "Semanticâ€aware automatic image colorization via unpaired cycleâ€consistent selfâ€supervised network." International Journal of Intelligent Systems 37.2 (2022): 1222-1238..

Dong, Xuan, et al. "Self-supervised colorization towards monochrome-color camera systems using cycle CNN." IEEE Transactions on Image Processing 30 (2021): 6609-6622.

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430, 2015.

Huang, Ziwang, et al. "Integration of patch features through self-supervised learning and transformer for survival analysis on whole slide images." Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24. Springer International Publishing, 2021.

Li, Chun-Liang, et al. "Cutpaste: Self-supervised learning for anomaly detection and localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Schiappa, Madeline C., Yogesh S. Rawat, and Mubarak Shah. "Self-supervised learning for videos: A survey." ACM Computing Surveys (2022).

Yoon, Yeo Chan. "Can We Exploit All Datasets? Multimodal Emotion Recognition Using Cross-Modal Translation." IEEE Access 10 (2022): 64516-64524.

Khare, Aparna, Srinivas Parthasarathy, and Shiva Sundaram. "Self-supervised learning with cross-modal transformers for emotion recognition." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021..

Ren, Sucheng, et al. "A simple data mixing prior for improving self-supervised learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In AAAI, 2020. 5, 12

Bergmann, Paul, et al. "The MVTec anomaly detection dataset: a comprehensive real-world dataset for unsupervised anomaly detection." International Journal of Computer Vision 129.4 (2021): 1038-1059.

Gidaris, Spyros, Praveer Singh, and Nikos Komodakis. "Unsupervised representation learning by predicting image rotations." arXiv preprint arXiv:1803.07728 (2018).

Ni, Aoxin, Arian Azarang, and Nasser Kehtarnavaz. "A review of deep learning-based contactless heart rate measurement methods." Sensors 21.11 (2021): 3719.

Liu, Xin, et al. "Multi-task temporal shift attention networks for on-device contactless vitals measurement." Advances in Neural Information Processing Systems 33 (2020): 19400-19411.

Wang, Hao, Euijoon Ahn, and Jinman Kim. "Self-supervised representation learning framework for remote physiological measurement using spatiotemporal augmentation loss." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 2. 2022.

Khan, Salman, et al. "Transformers in vision: A survey." ACM computing surveys (CSUR) 54.10s (2022): 1-41.

Liu, Ze, et al. "Video swin transformer." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

Khan, Salman, et al. "Transformers in vision: A survey." ACM computing surveys (CSUR) 54.10s (2022): 1-41.

Lund, Brady D., and Ting Wang. "Chatting about ChatGPT: how may AI and GPT impact academia and libraries?." Library Hi Tech News (2023).

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

Trummer, Immanuel. "CodexDB: Synthesizing code for query processing from natural language instructions using GPT-3 Codex." Proceedings of the VLDB Endowment 15.11 (2022): 2921-2928.

Yu, Zitong, et al. "Transrppg: Remote photoplethysmography transformer for 3d mask face presentation attack detection." IEEE Signal Processing Letters 28 (2021): 1290-1294.

Yu, Zitong, et al. "PhysFormer: facial video-based physiological measurement with temporal difference transformer." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Kollias, Dimitrios, et al. "Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond." International Journal of Computer Vision 127.6-7 (2019): 907-929.

Sabour, Rita Meziati, et al. "Ubfc-phys: A multimodal database for psychophysiological studies of social stress." IEEE Transactions on Affective Computing (2021).

Tsou, Yun-Yun, et al. "Siamese-rPPG network: Remote photoplethysmography signal estimation from face videos." Proceedings of the 35th annual ACM symposium on applied computing. 2020.

Stricker, Ronny, Steffen Müller, and Horst-Michael Gross. "Non-contact video-based pulse rate measurement on a mobile service robot." The 23rd IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 2014.



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development