According to the study of Human Visual System (HVS), human beings have a natural tendency to focus on small range areas around fixation with high resolution, and other areas are perceived with low resolution. The process of predicting fixation areas is called saliency detection. Associate Professor Xu Mai, who comes from the International Research Institute for Multidisciplinary Science and the School of Electronic and Information Engineering of Beihang University, has been devoted to the study of saliency detection model of video and image with his team. Their research results have been published in many important journals, like IEEE Transactions on Image Processing, International Conference on Computer Vision (ICCV) and Computer Vision and Pattern Recognition (CVPR), and been well received in the field of multimedia communication.
Research Background: “Congestion” on the Communication Road
Scientific research is to create a better future, and its inspiration is derived from the real life. Watching video through intelligent terminal has already become an essential part of human’s daily life. Consumers’ increasing need for HD video makes the hot spot of the communication industry focus on how to achieve more efficient transmission of massive high HD video data.
The bottleneck of the video communication field is that with the development of ultra HD video and panoramic video, the amount of data transmitted over the networks is increasing, while the bandwidth is limited, just like the road with limited space. With more and more cars on road, it may cause severe congestion. Efficient transmission under limited bandwidth has become an issue that needs to be solved quickly in the industry of communication.
Inspiration Source: Integration of Hot Issues
Xu Mai and his team’s research interest is video communication and image processing. Inspired by artificial intelligence and machine learning in the process of studying compression coding, they began to study the user experience of watching videos, which provides a new solution to compress video. Human retina is equivalent to an HD camera that measures a unit of a hundred million pixel level. When we are watching videos or images, the retina has the functionality of focusing on the “attention”, which means that for the presented scenes, only the attracted area is clear, and other areas around it are relatively vague.
Inspired by the HVS, the motivation of Xu Mai and his team is to make the computer imitate the HVS, i.e., more coding resources need to be allocated to areas with high visual attention. As such, the compressed video can meet people’s need for high quality, and at the same time, it can save coding resources on other areas. On the basis of Perceptual Video Coding, Teacher Xu developed a new research area: building a model to predict the visual focusing areas when people are watching videos or images.
Research Extension: Cooperation in a Young Team
Dr. Xu’s team is made up of a number of young people who are full of energy, passion and creativity. Among which, the youngest member is a junior student. For Dr. Xu, interest and competence are important factors in scientific research, regardless of age.
The research team worked together to dig out the rule of human attention by machine learning. They invited many participants and recorded the results of the areas the participants focused on when they were watching videos. Based on the collected data, they built a machine learning model by Support Vector Machine (SVM) and Deep Learning to achieve the prediction of visual focusing areas.
Their research findings/results not only can be used to compress videos and images, but also can be applied to many other fields, like web design and typesetting. Interactive and Immersive Panoramic video will become their emphasis in further research. Moreover, they will seek to cooperate with other research teams to carry out interdisciplinary research.
Reported by Qin Yuyao, Zhang Jinxing
Designed by Peng Xutan
Translated by Liu Xinrui