地点：上海交通大学闵行校区软件大楼5楼 人工智能研究院 500会议室
In this talk, I will present our recent results on visual tracking and video object segmentation.
The tracking-by-detection framework typically consists of two stages, i.e., drawing samples around the target object in the first stage and classifying each sample as the target object or as background in the second stage. The performance of existing trackers using deep classification networks is limited by two aspects. First, the positive samples in each frame are highly spatially overlapped, and they fail to capture rich appearance variations. Second, there exists extreme class imbalance between positive and negative samples. This VITAL algorithm aims to address these two problems via adversarial learning. To augment positive samples, we use a generative network to randomly generate masks, which are applied to adaptively dropout input features to capture a variety of appearance changes. With the use of adversarial learning, our network identifies the mask that maintains the most robust features of the target objects over a long temporal span. In addition, to handle the issue of class imbalance, we propose a high-order cost-sensitive loss to decrease the effect of easy negative samples to facilitate training the classification network. Extensive experiments on benchmark datasets demonstrate that the proposed tracker performs favorably against the state-of-the-art approaches.
Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily fine-tuning on the object mask in the first frame, which is time-consuming for online applications. In the second part, we propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Based on the tracked bounding boxes of parts, we construct a region-of-interest segmentation network to generate part masks. Finally, a similarity-based scoring function is adopted to refine these object parts by comparing them to the visual information in the first frame. Our method performs favorably against state-of-the-art algorithms in terms of accuracy on the DAVIS benchmark dataset, while achieving much faster runtime performance.
Ming-Hsuan Yang is a research scientist at Google and a professor in Electrical Engineering and Computer Science at University of California, Merced. He received the PhD degree in Computer Science from the University of Illinois at Urbana-Champaign in 2000. He serves as an area chair for several conferences including IEEE Conference on Computer Vision and Pattern Recognition, IEEE International Conference on Computer Vision, European Conference on Computer Vision, Asian Conference on Computer, and AAAI National Conference on Artificial Intelligence. He serves as a program co-chair for IEEE International Conference on Computer Vision in 2019 as well as Asian Conference on Computer Vision in 2014, and general co-chair for Asian Conference on Computer Vision in 2016. He serves as an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence (2007 to 2011), International Journal of Computer Vision, Computer Vision and Image Understanding, Image and Vision Computing, and Journal of Artificial Intelligence Research. Yang received the Google faculty award in 2009, and the Distinguished Early Career Research Award from the UC Merced senate in 2011, the Faculty Early Career Development (CAREER) award from the National Science Foundation in 2012, and the Distinguished Research Award from UC Merced Senate in 2015. He is an IEEE Fellow.