AI paper index
From Sparse to Dense: Label-Efficient Weakly Supervised Segmentation for Images and Videos
One-line summary
An AI research paper on From Sparse to Dense: Label-Efficient Weakly Supervised Segmentation for Images and Videos.
Engineering notes
Engineering notes will be added by the aipentium editorial team.
Chinese explanation / 中文解读
中文解读待补充:本站会优先为大语言模型、生成式AI、ChatGPT相关技术、计算机视觉、深度学习等高价值论文补充中文说明。
Original abstract
Obtaining high-quality annotated data has become a primary bottleneck for training deep learning models, particularly for dense prediction tasks like semantic segmentation and video salient object segmentation. The demand for meticulous, pixel-level labeling makes fully-supervised approaches prohibitively expensive and limits their scalability. This thesis directly confronts this issue by proposing and systematically investigating a label-efficient "sparse-to-dense'' paradigm. The core objective is to develop innovative frameworks that can learn from sparse, inexpensive weak supervision—such as image-level tags or scribbles—and intelligently propagate this minimal information to generate complete, pixel-accurate dense predictions. The major part of this thesis focuses on the "sparse-to-dense'' challenge within Weakly Supervised Semantic Segmentation (WSSS) for images, where models learn from only image-level tags. Dominant approaches rely on Class Activation Maps (CAMs) as initial object seeds. However, CAMs inherently activate only the most discriminative object regions, leading to incomplete pseudo-labels. This gives rise to the central problem of activation densification: the critical process of expanding these sparse activations to cover the full extent of an object. Our research addresses this challenge through a progressive, multi-faceted strategy. Our exploration begins with a foundational framework (C3) designed to rectify the intrinsic flaws of initial CAMs. We propose a synergistic mutual calibration method. This method leverages the discrepancies between the two maps to generate local prototypes, which are simultaneously utilized to complete missing object parts and suppress background noise, yielding a significantly more accurate and complete initial seed. This foundational work immediately reveals the limitation of single-prototype instability and its inability to capture diverse features. Building on this foundation, we then develop architecture-specific strategies to address deeper limitations of the prototype-based approach. For Convolutional Neural Networks (CNNs) (C4), the use of a single prototype proves insufficient to capture significant intra-class variation. In response, we introduce an Optimal Transport (OT)-assisted framework that effectively mines multiple, diverse prototypes, thereby yielding more comprehensive activation maps. However, while C4 solves the intra-class diversity problem, the inherent local context constraint of the CNN architecture becomes the new bottleneck, fundamentally limiting the quality of globally-aware feature extraction needed for optimal multi-prototype assignment. This limitation necessitates a shift in architectural focus from CNNs to Vision Transformers. In the ViT architecture (C5), a new challenge arises where the token gap between the discriminative class token and varied patch tokens leads to incomplete activation. To bridge this gap, we leverage a tailored application of Optimal Transport, where the algorithm is tasked with learning a comprehensive proxy to ensure complete object activation. Despite these architectural advances (C3-C5), the overall WSSS performance hit a ceiling. This is fundamentally limited by the sparse supervision of class labels, which cannot provide precise boundary information. To break this ceiling, our WSSS exploration culminates in a universal enhancement strategy (C6) that transcends architecture-specific designs by leveraging Visual Foundation Models (VFMs). We resolve the conflict between the discriminative nature of initial CAMs and the VFM's need for holistic prompts. By modeling the CAM foreground as a graph and employing spectral analysis, our method intelligently selects a diverse set of prompt points that represent the overall structure of the foreground object. This guides the VFM to generate superior, boundary-aware supervision that can universally enhance any WSSS pipeline. Subsequently, the thesis extends the "sparse-to-dense'' paradigm to the dynamic domain of video, addressing Weakly Supervised Video Salient Object Segmentation (WSVSOS) (C7) from sparse scribble annotations. In this specific context, the challenge evolves into spatio-temporal densification, which requires both full-object coverage within each frame and temporal consistency across the video sequences. To this end, we introduce the core CFMR framework, which utilizes Spatio-Temporal co-regulation. This framework is subsequently enhanced (CFMR+) by integrating the VFM guidance strategy from C6, which provides high-quality spatial priors essential for ensuring both high spatial detail and inter-frame consistency. By enforcing both intra-frame and cross-frame consistency between these two streams, our method effectively propagates the sparse initial signal to generate dense and temporally coherent saliency masks. In conclusion, this thesis presents a comprehensive and cohesive investigation into label-efficient segmentation tasks, offering a suite of innovative and effective strategies for the "sparse-to-dense'' challenge. Our work establishes a progressive roadmap that systematically overcomes architectural and semantic limitations, culminating in highly robust and generalized densification pipelines. Our extensive experimental results on standard benchmarks validate the state-of-the-art performance of these methods, providing robust solutions to a long-standing problem in the computer vision field.
Links and sources
Need this topic turned into a technical roadmap?
aipentium can prepare a custom AI literature review, code map, dataset map, and B2B technology assessment.
Request B2B AI research
Comments