Computer Vision and Pattern Recognition 132
☆ Learning from Streaming Video with Orthogonal Gradients CVPR2025
Tengda Han, Dilara Gokay, Joseph Heyward, Chuhan Zhang, Daniel Zoran, Viorica Pătrăucean, João Carreira, Dima Damen, Andrew Zisserman
We address the challenge of representation learning from a continuous stream
of video as input, in a self-supervised manner. This differs from the standard
approaches to video learning where videos are chopped and shuffled during
training in order to create a non-redundant batch that satisfies the
independently and identically distributed (IID) sample assumption expected by
conventional training paradigms. When videos are only available as a continuous
stream of input, the IID assumption is evidently broken, leading to poor
performance. We demonstrate the drop in performance when moving from shuffled
to sequential learning on three tasks: the one-video representation learning
method DoRA, standard VideoMAE on multi-video datasets, and the task of future
video prediction. To address this drop, we propose a geometric modification to
standard optimizers, to decorrelate batches by utilising orthogonal gradients
during training. The proposed modification can be applied to any optimizer --
we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our
proposed orthogonal optimizer allows models trained from streaming videos to
alleviate the drop in representation learning performance, as evaluated on
downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we
show our orthogonal optimizer outperforms the strong AdamW in all three
scenarios.
comment: CVPR2025
☆ Diffusion-Guided Gaussian Splatting for Large-Scale Unconstrained 3D Reconstruction and Novel View Synthesis WACV
Niluthpol Chowdhury Mithun, Tuan Pham, Qiao Wang, Ben Southall, Kshitij Minhas, Bogdan Matei, Stephan Mandt, Supun Samarasekera, Rakesh Kumar
Recent advancements in 3D Gaussian Splatting (3DGS) and Neural Radiance
Fields (NeRF) have achieved impressive results in real-time 3D reconstruction
and novel view synthesis. However, these methods struggle in large-scale,
unconstrained environments where sparse and uneven input coverage, transient
occlusions, appearance variability, and inconsistent camera settings lead to
degraded quality. We propose GS-Diff, a novel 3DGS framework guided by a
multi-view diffusion model to address these limitations. By generating
pseudo-observations conditioned on multi-view inputs, our method transforms
under-constrained 3D reconstruction problems into well-posed ones, enabling
robust optimization even with sparse data. GS-Diff further integrates several
enhancements, including appearance embedding, monocular depth priors, dynamic
object modeling, anisotropy regularization, and advanced rasterization
techniques, to tackle geometric and photometric challenges in real-world
settings. Experiments on four benchmarks demonstrate that GS-Diff consistently
outperforms state-of-the-art baselines by significant margins.
comment: WACV ULTRRA Workshop 2025
☆ GaussianLSS -- Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting CVPR 2025
Bird's-eye view (BEV) perception has gained significant attention because it
provides a unified representation to fuse multiple view images and enables a
wide range of down-stream autonomous driving tasks, such as forecasting and
planning. Recent state-of-the-art models utilize projection-based methods which
formulate BEV perception as query learning to bypass explicit depth estimation.
While we observe promising advancements in this paradigm, they still fall short
of real-world applications because of the lack of uncertainty modeling and
expensive computational requirement. In this work, we introduce GaussianLSS, a
novel uncertainty-aware BEV perception framework that revisits
unprojection-based methods, specifically the Lift-Splat-Shoot (LSS) paradigm,
and enhances them with depth un-certainty modeling. GaussianLSS represents
spatial dispersion by learning a soft depth mean and computing the variance of
the depth distribution, which implicitly captures object extents. We then
transform the depth distribution into 3D Gaussians and rasterize them to
construct uncertainty-aware BEV features. We evaluate GaussianLSS on the
nuScenes dataset, achieving state-of-the-art performance compared to
unprojection-based methods. In particular, it provides significant advantages
in speed, running 2.5x faster, and in memory efficiency, using 0.3x less memory
compared to projection-based methods, while achieving competitive performance
with only a 0.4% IoU difference.
comment: Accepted to CVPR 2025
☆ VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Recovering 3D scenes from sparse views is a challenging task due to its
inherent ill-posed problem. Conventional methods have developed specialized
solutions (e.g., geometry regularization or feed-forward deterministic model)
to mitigate the issue. However, they still suffer from performance degradation
by minimal overlap across input views with insufficient visual information.
Fortunately, recent video generative models show promise in addressing this
challenge as they are capable of generating video clips with plausible 3D
structures. Powered by large pretrained video diffusion models, some pioneering
research start to explore the potential of video generative prior and create 3D
scenes from sparse views. Despite impressive improvements, they are limited by
slow inference time and the lack of 3D constraint, leading to inefficiencies
and reconstruction artifacts that do not align with real-world geometry
structure. In this paper, we propose VideoScene to distill the video diffusion
model to generate 3D scenes in one step, aiming to build an efficient and
effective tool to bridge the gap from video to 3D. Specifically, we design a
3D-aware leap flow distillation strategy to leap over time-consuming redundant
information and train a dynamic denoising policy network to adaptively
determine the optimal leap timestep during inference. Extensive experiments
demonstrate that our VideoScene achieves faster and superior 3D scene
generation results than previous video diffusion models, highlighting its
potential as an efficient tool for future video to 3D applications. Project
Page: https://hanyang-21.github.io/VideoScene
comment: Project Page: https://hanyang-21.github.io/VideoScene
☆ Scene-Centric Unsupervised Panoptic Segmentation CVPR 2025
Unsupervised panoptic segmentation aims to partition an image into
semantically meaningful regions and distinct object instances without training
on manually annotated data. In contrast to prior work on unsupervised panoptic
scene understanding, we eliminate the need for object-centric training data,
enabling the unsupervised understanding of complex scenes. To that end, we
present the first unsupervised panoptic method that directly trains on
scene-centric imagery. In particular, we propose an approach to obtain
high-resolution panoptic pseudo labels on complex scene-centric data, combining
visual representations, depth, and motion cues. Utilizing both pseudo-label
training and a panoptic self-training strategy yields a novel approach that
accurately predicts panoptic segmentation of complex scenes without requiring
any human annotations. Our approach significantly improves panoptic quality,
e.g., surpassing the recent state of the art in unsupervised panoptic
segmentation on Cityscapes by 9.4% points in PQ.
comment: To appear at CVPR 2025. Christoph Reich and Oliver Hahn - both
authors contributed equally. Code: https://github.com/visinf/cups Project
page: https://visinf.github.io/cups/
☆ Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Jing Liu, Wenxuan Wang, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang
Referring expression segmentation (RES) aims at segmenting the entities'
masks that match the descriptive language expression. While traditional RES
methods primarily address object-level grounding, real-world scenarios demand a
more versatile framework that can handle multiple levels of target granularity,
such as multi-object, single object or part-level references. This introduces
great challenges due to the diverse and nuanced ways users describe targets.
However, existing datasets and models mainly focus on designing grounding
specialists for object-level target localization, lacking the necessary data
resources and unified frameworks for the more practical multi-grained RES. In
this paper, we take a step further towards visual granularity unified RES task.
To overcome the limitation of data scarcity, we introduce a new
multi-granularity referring expression segmentation (MRES) task, alongside the
RefCOCOm benchmark, which includes part-level annotations for advancing
finer-grained visual understanding. In addition, we create MRES-32M, the
largest visual grounding dataset, comprising over 32.2M masks and captions
across 1M images, specifically designed for part-level vision-language
grounding. To tackle the challenges of multi-granularity RES, we propose
UniRES++, a unified multimodal large language model that integrates
object-level and part-level RES tasks. UniRES++ incorporates targeted designs
for fine-grained visual feature exploration. With the joint model architecture
and parameters, UniRES++ achieves state-of-the-art performance across multiple
benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and
RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into
multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and
model UniRES++ will be publicly available at
https://github.com/Rubics-Xuan/MRES.
☆ Deep Representation Learning for Unsupervised Clustering of Myocardial Fiber Trajectories in Cardiac Diffusion Tensor Imaging MICCAI 2025
Understanding the complex myocardial architecture is critical for diagnosing
and treating heart disease. However, existing methods often struggle to
accurately capture this intricate structure from Diffusion Tensor Imaging (DTI)
data, particularly due to the lack of ground truth labels and the ambiguous,
intertwined nature of fiber trajectories. We present a novel deep learning
framework for unsupervised clustering of myocardial fibers, providing a
data-driven approach to identifying distinct fiber bundles. We uniquely combine
a Bidirectional Long Short-Term Memory network to capture local sequential
information along fibers, with a Transformer autoencoder to learn global shape
features, with pointwise incorporation of essential anatomical context.
Clustering these representations using a density-based algorithm identifies 33
to 62 robust clusters, successfully capturing the subtle distinctions in fiber
trajectories with varying levels of granularity. Our framework offers a new,
flexible, and quantitative way to analyze myocardial structure, achieving a
level of delineation that, to our knowledge, has not been previously achieved,
with potential applications in improving surgical planning, characterizing
disease-related remodeling, and ultimately, advancing personalized cardiac
care.
comment: 10 pages, 5 figures. Submitted to MICCAI 2025 (under review)
☆ Image Difference Grounding with Natural Language
Visual grounding (VG) typically focuses on locating regions of interest
within an image using natural language, and most existing VG methods are
limited to single-image interpretations. This limits their applicability in
real-world scenarios like automatic surveillance, where detecting subtle but
meaningful visual differences across multiple images is crucial. Besides,
previous work on image difference understanding (IDU) has either focused on
detecting all change regions without cross-modal text guidance, or on providing
coarse-grained descriptions of differences. Therefore, to push towards
finer-grained vision-language perception, we propose Image Difference Grounding
(IDG), a task designed to precisely localize visual differences based on user
instructions. We introduce DiffGround, a large-scale and high-quality dataset
for IDG, containing image pairs with diverse visual variations along with
instructions querying fine-grained differences. Besides, we present a baseline
model for IDG, DiffTracker, which effectively integrates feature differential
enhancement and common suppression to precisely locate differences. Experiments
on the DiffGround dataset highlight the importance of our IDG dataset in
enabling finer-grained IDU. To foster future research, both DiffGround data and
DiffTracker model will be publicly released.
☆ End-to-End Driving with Online Trajectory Evaluation via BEV World Model
End-to-end autonomous driving has achieved remarkable progress by integrating
perception, prediction, and planning into a fully differentiable framework.
Yet, to fully realize its potential, an effective online trajectory evaluation
is indispensable to ensure safety. By forecasting the future outcomes of a
given trajectory, trajectory evaluation becomes much more effective. This goal
can be achieved by employing a world model to capture environmental dynamics
and predict future states. Therefore, we propose an end-to-end driving
framework WoTE, which leverages a BEV World model to predict future BEV states
for Trajectory Evaluation. The proposed BEV world model is latency-efficient
compared to image-level world models and can be seamlessly supervised using
off-the-shelf BEV-space traffic simulators. We validate our framework on both
the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the
CARLA simulator, achieving state-of-the-art performance. Code is released at
https://github.com/liyingyanUCAS/WoTE.
☆ ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu
We present ILLUME+ that leverages dual visual tokenization and a diffusion
decoder to improve both deep semantic understanding and high-fidelity image
generation. Existing unified models have struggled to simultaneously handle the
three fundamental capabilities in a unified model: understanding, generation,
and editing. Models like Chameleon and EMU3 utilize VQGAN for image
discretization, due to the lack of deep semantic interaction, they lag behind
specialist models like LLaVA in visual understanding tasks. To mitigate this,
LaViT and ILLUME employ semantic encoders for tokenization, but they struggle
with image editing due to poor texture preservation. Meanwhile, Janus series
decouples the input and output image representation, limiting their abilities
to seamlessly handle interleaved image-text understanding and generation. In
contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which
preserves both fine-grained textures and text-aligned semantics while enabling
a coarse-to-fine image representation strategy for multimodal understanding and
generation. Additionally, we employ a diffusion model as the image detokenizer
for enhanced generation quality and efficient super-resolution. ILLUME+ follows
a continuous-input, discrete-output scheme within the unified MLLM and adopts a
progressive training procedure that supports dynamic resolution across the
vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible
and efficient context-aware image editing and generation across diverse tasks.
ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs
and specialized models across multimodal understanding, generation, and editing
benchmarks. With its strong performance, ILLUME+ provides a scalable and
versatile foundation for future multimodal applications. Project Page:
https://illume-unified-mllm.github.io/.
☆ Equivariant Spherical CNNs for Accurate Fiber Orientation Distribution Estimation in Neonatal Diffusion MRI with Reduced Acquisition Time
Early and accurate assessment of brain microstructure using diffusion
Magnetic Resonance Imaging (dMRI) is crucial for identifying neurodevelopmental
disorders in neonates, but remains challenging due to low signal-to-noise ratio
(SNR), motion artifacts, and ongoing myelination. In this study, we propose a
rotationally equivariant Spherical Convolutional Neural Network (sCNN)
framework tailored for neonatal dMRI. We predict the Fiber Orientation
Distribution (FOD) from multi-shell dMRI signals acquired with a reduced set of
gradient directions (30% of the full protocol), enabling faster and more
cost-effective acquisitions. We train and evaluate the performance of our sCNN
using real data from 43 neonatal dMRI datasets provided by the Developing Human
Connectome Project (dHCP). Our results demonstrate that the sCNN achieves
significantly lower mean squared error (MSE) and higher angular correlation
coefficient (ACC) compared to a Multi-Layer Perceptron (MLP) baseline,
indicating improved accuracy in FOD estimation. Furthermore, tractography
results based on the sCNN-predicted FODs show improved anatomical plausibility,
coverage, and coherence compared to those from the MLP. These findings
highlight that sCNNs, with their inherent rotational equivariance, offer a
promising approach for accurate and clinically efficient dMRI analysis, paving
the way for improved diagnostic capabilities and characterization of early
brain development.
☆ FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs
As a pioneering vision-language model, CLIP (Contrastive Language-Image
Pre-training) has achieved significant success across various domains and a
wide range of downstream vision-language tasks. However, the text encoders in
popular CLIP models are limited to processing only 77 text tokens, which
constrains their ability to effectively handle longer, detail-rich captions.
Additionally, CLIP models often struggle to effectively capture detailed visual
and textual information, which hampers their performance on tasks that require
fine-grained analysis. To address these limitations, we present a novel
approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP
enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained
alignment with \textbf{L}onger text input within the CL\textbf{IP}-style
framework. FineLIP first extends the positional embeddings to handle longer
text, followed by the dynamic aggregation of local image and text tokens. The
aggregated results are then used to enforce fine-grained token-to-token
cross-modal alignment. We validate our model on datasets with long, detailed
captions across two tasks: zero-shot cross-modal retrieval and text-to-image
generation. Quantitative and qualitative experimental results demonstrate the
effectiveness of FineLIP, outperforming existing state-of-the-art approaches.
Furthermore, comprehensive ablation studies validate the benefits of key design
elements within FineLIP.
☆ Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
The rapid development of Large Multimodal Models (LMMs) for 2D images and
videos has spurred efforts to adapt these models for interpreting 3D scenes.
However, the absence of large-scale 3D vision-language datasets has posed a
significant obstacle. To address this issue, typical approaches focus on
injecting 3D awareness into 2D LMMs by designing 3D input-level scene
representations. This work provides a new perspective. We introduce
reconstructive visual instruction tuning with 3D-awareness (Ross3D), which
integrates 3D-aware visual supervision into the training procedure.
Specifically, it incorporates cross-view and global-view reconstruction. The
former requires reconstructing masked views by aggregating overlapping
information from other views. The latter aims to aggregate information from all
available views to recover Bird's-Eye-View images, contributing to a
comprehensive overview of the entire scene. Empirically, Ross3D achieves
state-of-the-art performance across various 3D scene understanding benchmarks.
More importantly, our semi-supervised experiments demonstrate significant
potential in leveraging large amounts of unlabeled 3D vision-only data.
☆ Is Temporal Prompting All We Need For Limited Labeled Action Recognition? CVPR
Video understanding has shown remarkable improvements in recent years,
largely dependent on the availability of large scaled labeled datasets. Recent
advancements in visual-language models, especially based on contrastive
pretraining, have shown remarkable generalization in zero-shot tasks, helping
to overcome this dependence on labeled datasets. Adaptations of such models for
videos, typically involve modifying the architecture of vision-language models
to cater to video data. However, this is not trivial, since such adaptations
are mostly computationally intensive and struggle with temporal modeling. We
present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting
for temporal adaptation without modifying the core CLIP architecture. This
preserves its generalization abilities. TP-CLIP efficiently integrates into the
CLIP architecture, leveraging its pre-trained capabilities for video data.
Extensive experiments across various datasets demonstrate its efficacy in
zero-shot and few-shot learning, outperforming existing approaches with fewer
parameters and computational efficiency. In particular, we use just 1/3 the
GFLOPs and 1/28 the number of tuneable parameters in comparison to recent
state-of-the-art and still outperform it by up to 15.8% depending on the task
and dataset.
comment: Accepted in CVPR-W 2025
☆ GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning
Yanzhou Su, Tianbin Li, Jiyao Liu, Chenglong Ma, Junzhi Ning, Cheng Tang, Sibo Ju, Jin Ye, Pengcheng Chen, Ming Hu, Shixiang Tang, Lihao Liu, Bin Fu, Wenqi Shao, Xiaowei Hu, Xiangwen Liao, Yuanfeng Ji, Junjun He
Recent advances in general medical AI have made significant strides, but
existing models often lack the reasoning capabilities needed for complex
medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical
reasoning model enhanced by reinforcement learning (RL) to improve its
reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes
decision-making, significantly boosting diagnostic accuracy and clinical
support. We also develop a reasoning data synthesis method, generating
step-by-step reasoning data via rejection sampling, which further enhances the
model's generalization. Experimental results show that after RL training,
GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question
answering. While the model demonstrates basic memorization with supervised
fine-tuning, RL is crucial for true generalization. Our work establishes new
evaluation benchmarks and paves the way for future advancements in medical
reasoning models. Code, data, and model will be released at
\href{https://github.com/uni-medical/GMAI-VL-R1}{this link}.
☆ TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables
Humans continuously make new discoveries, and understanding temporal sequence
of events leading to these breakthroughs is essential for advancing science and
society. This ability to reason over time allows us to identify future steps
and understand the effects of financial and political decisions on our lives.
However, large language models (LLMs) are typically trained on static datasets,
limiting their ability to perform effective temporal reasoning. To assess the
temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES
dataset, which comprises 3,971 questions derived from over 14,000 tables,
spanning 1,238 entities across multiple time periods. We introduce a
template-based question-generation pipeline that harnesses LLMs to refine both
templates and questions. Additionally, we establish baseline results using
state-of-the-art LLMs to create a benchmark. We also introduce novel modeling
strategies centered around task decomposition, enhancing LLM performance.
comment: 19 Pages. 21 Tables, 1 figure
☆ A Diffusion-Based Framework for Occluded Object Movement
Zheng-Peng Duan, Jiawei Zhang, Siyu Liu, Zheng Lin, Chun-Le Guo, Dongqing Zou, Jimmy Ren, Chongyi Li
Seamlessly moving objects within a scene is a common requirement for image
editing, but it is still a challenge for existing editing methods. Especially
for real-world images, the occlusion situation further increases the
difficulty. The main difficulty is that the occluded portion needs to be
completed before movement can proceed. To leverage the real-world knowledge
embedded in the pre-trained diffusion models, we propose a Diffusion-based
framework specifically designed for Occluded Object Movement, named DiffOOM.
The proposed DiffOOM consists of two parallel branches that perform object
de-occlusion and movement simultaneously. The de-occlusion branch utilizes a
background color-fill strategy and a continuously updated object mask to focus
the diffusion process on completing the obscured portion of the target object.
Concurrently, the movement branch employs latent optimization to place the
completed object in the target location and adopts local text-conditioned
guidance to integrate the object into new surroundings appropriately. Extensive
evaluations demonstrate the superior performance of our method, which is
further validated by a comprehensive user study.
☆ CoMatcher: Multi-View Collaborative Feature Matching CVPR 2025
This paper proposes a multi-view collaborative matching strategy for reliable
track construction in complex scenarios. We observe that the pairwise matching
paradigms applied to image set matching often result in ambiguous estimation
when the selected independent pairs exhibit significant occlusions or extreme
viewpoint changes. This challenge primarily stems from the inherent uncertainty
in interpreting intricate 3D structures based on limited two-view observations,
as the 3D-to-2D projection leads to significant information loss. To address
this, we introduce CoMatcher, a deep multi-view matcher to (i) leverage
complementary context cues from different views to form a holistic 3D scene
understanding and (ii) utilize cross-view projection consistency to infer a
reliable global solution. Building on CoMatcher, we develop a groupwise
framework that fully exploits cross-view relationships for large-scale matching
tasks. Extensive experiments on various complex scenarios demonstrate the
superiority of our method over the mainstream two-view matching paradigm.
comment: 15 pages, 7 figures, to be published in CVPR 2025
☆ BOGausS: Better Optimized Gaussian Splatting
3D Gaussian Splatting (3DGS) proposes an efficient solution for novel view
synthesis. Its framework provides fast and high-fidelity rendering. Although
less complex than other solutions such as Neural Radiance Fields (NeRF), there
are still some challenges building smaller models without sacrificing quality.
In this study, we perform a careful analysis of 3DGS training process and
propose a new optimization methodology. Our Better Optimized Gaussian Splatting
(BOGausS) solution is able to generate models up to ten times lighter than the
original 3DGS with no quality degradation, thus significantly boosting the
performance of Gaussian Splatting compared to the state of the art.
☆ Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Artificial Intelligence (AI) in skin disease diagnosis has improved
significantly, but a major concern is that these models frequently show biased
performance across subgroups, especially regarding sensitive attributes such as
skin color. To address these issues, we propose a novel generative AI-based
framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages
text prompts generated via Vision Language Models and multimodal text-image
learning to generate new dermoscopic images. We utilize large vision language
models to generate accurate and proper prompts for each dermoscopic image which
helps to generate synthetic images to improve the representation of
underrepresented groups (patient, disease, etc.) in highly imbalanced datasets
for clinical diagnoses. Our extensive experimentation showcases the large
vision language models providing much more insightful representations, that
enable DermDiT to generate high-quality images. Our code is available at
https://github.com/Munia03/DermDiT
comment: Paper accepted at International Symposium on Biomedical Imaging (ISBI
2025)
☆ Implicit Bias Injection Attacks against Text-to-Image Diffusion Models CVPR 2025
The proliferation of text-to-image diffusion models (T2I DMs) has led to an
increased presence of AI-generated images in daily life. However, biased T2I
models can generate content with specific tendencies, potentially influencing
people's perceptions. Intentional exploitation of these biases risks conveying
misleading information to the public. Current research on bias primarily
addresses explicit biases with recognizable visual patterns, such as skin color
and gender. This paper introduces a novel form of implicit bias that lacks
explicit visual features but can manifest in diverse ways across various
semantic contexts. This subtle and versatile nature makes this bias challenging
to detect, easy to propagate, and adaptable to a wide range of scenarios. We
further propose an implicit bias injection attack framework (IBI-Attacks)
against T2I diffusion models by precomputing a general bias direction in the
prompt embedding space and adaptively adjusting it based on different inputs.
Our attack module can be seamlessly integrated into pre-trained diffusion
models in a plug-and-play manner without direct manipulation of user input or
model retraining. Extensive experiments validate the effectiveness of our
scheme in introducing bias through subtle and diverse modifications while
preserving the original semantics. The strong concealment and transferability
of our attack across various scenarios further underscore the significance of
our approach. Code is available at https://github.com/Hannah1102/IBI-attacks.
comment: Accept to CVPR 2025
☆ Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning
Enhancing the spatial reasoning capabilities of Multi-modal Large Language
Models (MLLMs) for video understanding is crucial yet challenging. We present
Spatial-R1, a targeted approach involving two key contributions: the curation
of SR, a new video spatial reasoning dataset from ScanNet with automatically
generated QA pairs across seven task types, and the application of
Task-Specific Group Relative Policy Optimization (GRPO) for fine-tuning. By
training the Qwen2.5-VL-7B-Instruct model on SR using GRPO, Spatial-R1
significantly advances performance on the VSI-Bench benchmark, achieving a
7.4\% gain over the baseline and outperforming strong contemporary models. This
work validates the effectiveness of specialized data curation and optimization
techniques for improving complex spatial reasoning in video MLLMs.
☆ UniViTAR: Unified Vision Transformer with Native Resolution
Conventional Vision Transformer simplifies visual modeling by standardizing
input resolutions, often disregarding the variability of natural visual data
and compromising spatial-contextual fidelity. While preliminary explorations
have superficially investigated native resolution modeling, existing approaches
still lack systematic analysis from a visual representation perspective. To
bridge this gap, we introduce UniViTAR, a family of homogeneous vision
foundation models tailored for unified visual modality and native resolution
scenario in the era of multimodal. Our framework first conducts architectural
upgrades to the vanilla paradigm by integrating multiple advanced components.
Building upon these improvements, a progressive training paradigm is
introduced, which strategically combines two core mechanisms: (1) resolution
curriculum learning, transitioning from fixed-resolution pretraining to native
resolution tuning, thereby leveraging ViT's inherent adaptability to
variable-length sequences, and (2) visual modality adaptation via inter-batch
image-video switching, which balances computational efficiency with enhanced
temporal reasoning. In parallel, a hybrid training framework further synergizes
sigmoid-based contrastive loss with feature distillation from a frozen teacher
model, thereby accelerating early-stage convergence. Finally, trained
exclusively on public datasets, externsive experiments across multiple model
scales from 0.3B to 1B demonstrate its effectiveness.
☆ Memory-efficient Low-latency Remote Photoplethysmography through Temporal-Spatial State Space Duality
Remote photoplethysmography (rPPG), enabling non-contact physiological
monitoring through facial light reflection analysis, faces critical
computational bottlenecks as deep learning introduces performance gains at the
cost of prohibitive resource demands. This paper proposes ME-rPPG, a
memory-efficient algorithm built on temporal-spatial state space duality, which
resolves the trilemma of model scalability, cross-dataset generalization, and
real-time constraints. Leveraging a transferable state space, ME-rPPG
efficiently captures subtle periodic variations across facial frames while
maintaining minimal computational overhead, enabling training on extended video
sequences and supporting low-latency inference. Achieving cross-dataset MAEs of
5.38 (MMPD), 0.70 (VitalVideo), and 0.25 (PURE), ME-rPPG outperforms all
baselines with improvements ranging from 21.3% to 60.2%. Our solution enables
real-time inference with only 3.6 MB memory usage and 9.46 ms latency --
surpassing existing methods by 19.5%-49.7% accuracy and 43.2% user satisfaction
gains in real-world deployments. The code and demos are released for
reproducibility on https://github.com/Health-HCI-Group/ME-rPPG-demo.
☆ Leveraging Embedding Techniques in Multimodal Machine Learning for Mental Illness Assessment
The increasing global prevalence of mental disorders, such as depression and
PTSD, requires objective and scalable diagnostic tools. Traditional clinical
assessments often face limitations in accessibility, objectivity, and
consistency. This paper investigates the potential of multimodal machine
learning to address these challenges, leveraging the complementary information
available in text, audio, and video data. Our approach involves a comprehensive
analysis of various data preprocessing techniques, including novel chunking and
utterance-based formatting strategies. We systematically evaluate a range of
state-of-the-art embedding models for each modality and employ Convolutional
Neural Networks (CNNs) and Bidirectional LSTM Networks (BiLSTMs) for feature
extraction. We explore data-level, feature-level, and decision-level fusion
techniques, including a novel integration of Large Language Model (LLM)
predictions. We also investigate the impact of replacing Multilayer Perceptron
classifiers with Support Vector Machines. We extend our analysis to severity
prediction using PHQ-8 and PCL-C scores and multi-class classification
(considering co-occurring conditions). Our results demonstrate that
utterance-based chunking significantly improves performance, particularly for
text and audio modalities. Decision-level fusion, incorporating LLM
predictions, achieves the highest accuracy, with a balanced accuracy of 94.8%
for depression and 96.2% for PTSD detection. The combination of CNN-BiLSTM
architectures with utterance-level chunking, coupled with the integration of
external LLM, provides a powerful and nuanced approach to the detection and
assessment of mental health conditions. Our findings highlight the potential of
MMML for developing more accurate, accessible, and personalized mental
healthcare tools.
☆ Dual-stream Transformer-GCN Model with Contextualized Representations Learning for Monocular 3D Human Pose Estimation
This paper introduces a novel approach to monocular 3D human pose estimation
using contextualized representation learning with the Transformer-GCN
dual-stream model. Monocular 3D human pose estimation is challenged by depth
ambiguity, limited 3D-labeled training data, imbalanced modeling, and
restricted model generalization. To address these limitations, our work
introduces a groundbreaking motion pre-training method based on contextualized
representation learning. Specifically, our method involves masking 2D pose
features and utilizing a Transformer-GCN dual-stream model to learn
high-dimensional representations through a self-distillation setup. By focusing
on contextualized representation learning and spatial-temporal modeling, our
approach enhances the model's ability to understand spatial-temporal
relationships between postures, resulting in superior generalization.
Furthermore, leveraging the Transformer-GCN dual-stream model, our approach
effectively balances global and local interactions in video pose estimation.
The model adaptively integrates information from both the Transformer and GCN
streams, where the GCN stream effectively learns local relationships between
adjacent key points and frames, while the Transformer stream captures
comprehensive global spatial and temporal features. Our model achieves
state-of-the-art performance on two benchmark datasets, with an MPJPE of 38.0mm
and P-MPJPE of 31.9mm on Human3.6M, and an MPJPE of 15.9mm on MPI-INF-3DHP.
Furthermore, visual experiments on public datasets and in-the-wild videos
demonstrate the robustness and generalization capabilities of our approach.
☆ Bridge the Gap between SNN and ANN for Image Restoration
Models of dense prediction based on traditional Artificial Neural Networks
(ANNs) require a lot of energy, especially for image restoration tasks.
Currently, neural networks based on the SNN (Spiking Neural Network) framework
are beginning to make their mark in the field of image restoration, especially
as they typically use less than 10\% of the energy of ANNs with the same
architecture. However, training an SNN is much more expensive than training an
ANN, due to the use of the heuristic gradient descent strategy. In other words,
the process of SNN's potential membrane signal changing from sparse to dense is
very slow, which affects the convergence of the whole model.To tackle this
problem, we propose a novel distillation technique, called asymmetric framework
(ANN-SNN) distillation, in which the teacher is an ANN and the student is an
SNN. Specifically, we leverage the intermediate features (feature maps) learned
by the ANN as hints to guide the training process of the SNN. This approach not
only accelerates the convergence of the SNN but also improves its final
performance, effectively bridging the gap between the efficiency of the SNN and
the superior learning capabilities of ANN. Extensive experimental results show
that our designed SNN-based image restoration model, which has only 1/300 the
number of parameters of the teacher network and 1/50 the energy consumption of
the teacher network, is as good as the teacher network in some denoising tasks.
comment: Under review
☆ Understanding Cross-Model Perceptual Invariances Through Ensemble Metamers
Understanding the perceptual invariances of artificial neural networks is
essential for improving explainability and aligning models with human vision.
Metamers - stimuli that are physically distinct yet produce identical neural
activations - serve as a valuable tool for investigating these invariances. We
introduce a novel approach to metamer generation by leveraging ensembles of
artificial neural networks, capturing shared representational subspaces across
diverse architectures, including convolutional neural networks and vision
transformers. To characterize the properties of the generated metamers, we
employ a suite of image-based metrics that assess factors such as semantic
fidelity and naturalness. Our findings show that convolutional neural networks
generate more recognizable and human-like metamers, while vision transformers
produce realistic but less transferable metamers, highlighting the impact of
architectural biases on representational invariances.
☆ AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently
witnessed remarkable advancements and are increasingly being deployed in
real-world applications. However, inheriting the sensitivity of visual neural
networks, LVLMs remain vulnerable to adversarial attacks, which can result in
erroneous or malicious outputs. While existing efforts utilize adversarial
fine-tuning to enhance robustness, they often suffer from performance
degradation on clean inputs. In this paper, we proposes AdPO, a novel
adversarial defense strategy for LVLMs based on preference optimization. For
the first time, we reframe adversarial training as a preference optimization
problem, aiming to enhance the model's preference for generating normal outputs
on clean inputs while rejecting the potential misleading outputs for
adversarial examples. Notably, AdPO achieves this by solely modifying the image
encoder, e.g., CLIP ViT, resulting in superior clean and adversarial
performance in a variety of downsream tasks. Considering that training involves
large language models (LLMs), the computational cost increases significantly.
We validate that training on smaller LVLMs and subsequently transferring to
larger models can achieve competitive performance while maintaining efficiency
comparable to baseline methods. Our comprehensive experiments confirm the
effectiveness of the proposed AdPO, which provides a novel perspective for
future adversarial defense research.
☆ FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking
The development of large-scale 3D scene reconstruction and novel view
synthesis methods mostly rely on datasets comprising perspective images with
narrow fields of view (FoV). While effective for small-scale scenes, these
datasets require large image sets and extensive structure-from-motion (SfM)
processing, limiting scalability. To address this, we introduce a fisheye image
dataset tailored for scene reconstruction tasks. Using dual 200-degree fisheye
lenses, our dataset provides full 360-degree coverage of 5 indoor and 5 outdoor
scenes. Each scene has sparse SfM point clouds and precise LIDAR-derived dense
point clouds that can be used as geometric ground-truth, enabling robust
benchmarking under challenging conditions such as occlusions and reflections.
While the baseline experiments focus on vanilla Gaussian Splatting and NeRF
based Nerfacto methods, the dataset supports diverse approaches for scene
reconstruction, novel view synthesis, and image-based rendering.
comment: SCIA 2025
☆ DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
While recent image-based human animation methods achieve realistic body and
facial motion synthesis, critical gaps remain in fine-grained holistic
controllability, multi-scale adaptability, and long-term temporal coherence,
which leads to their lower expressiveness and robustness. We propose a
diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid
guidance to overcome these limitations. For motion guidance, our hybrid control
signals that integrate implicit facial representations, 3D head spheres, and 3D
body skeletons achieve robust control of facial expressions and body movements,
while producing expressive and identity-preserving animations. For scale
adaptation, to handle various body poses and image scales ranging from
portraits to full-body views, we employ a progressive training strategy using
data with varying resolutions and scales. For appearance guidance, we integrate
motion patterns from sequential frames with complementary visual references,
ensuring long-term temporal coherence for unseen regions during complex
movements. Experiments demonstrate that our method outperforms the
state-of-the-art works, delivering expressive results for portraits,
upper-body, and full-body generation with robust long-term consistency. Project
Page: https://grisoon.github.io/DreamActor-M1/.
☆ {GSR4B}: Biomass Map Super-Resolution with Sentinel-1/2 Guidance
Kaan Karaman, Yuchang Jiang, Damien Robert, Vivien Sainte Fare Garnot, Maria João Santos, Jan Dirk Wegner
Accurate Above-Ground Biomass (AGB) mapping at both large scale and high
spatio-temporal resolution is essential for applications ranging from climate
modeling to biodiversity assessment, and sustainable supply chain monitoring.
At present, fine-grained AGB mapping relies on costly airborne laser scanning
acquisition campaigns usually limited to regional scales. Initiatives such as
the ESA CCI map attempt to generate global biomass products from diverse
spaceborne sensors but at a coarser resolution. To enable global,
high-resolution (HR) mapping, several works propose to regress AGB from HR
satellite observations such as ESA Sentinel-1/2 images. We propose a novel way
to address HR AGB estimation, by leveraging both HR satellite observations and
existing low-resolution (LR) biomass products. We cast this problem as Guided
Super-Resolution (GSR), aiming at upsampling LR biomass maps (sources) from
$100$ to $10$ m resolution, using auxiliary HR co-registered satellite images
(guides). We compare super-resolving AGB maps with and without guidance,
against direct regression from satellite images, on the public BioMassters
dataset. We observe that Multi-Scale Guidance (MSG) outperforms direct
regression both for regression ($-780$ t/ha RMSE) and perception ($+2.0$ dB
PSNR) metrics, and better captures high-biomass values, without significant
computational overhead. Interestingly, unlike the RGB+Depth setting they were
originally designed for, our best-performing AGB GSR approaches are those that
most preserve the guide image texture. Our results make a strong case for
adopting the GSR framework for accurate HR biomass mapping at scale. Our code
and model weights are made publicly available
(https://github.com/kaankaramanofficial/GSR4B).
comment: Accepted for an oral presentation at the ISPRS Geospatial Week 2025
☆ InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems
Diffusion Models have demonstrated remarkable capabilities in handling
inverse problems, offering high-quality posterior-sampling-based solutions.
Despite significant advances, a fundamental trade-off persists, regarding the
way the conditioned synthesis is employed: Training-based methods achieve high
quality results, while zero-shot approaches trade this with flexibility. This
work introduces a framework that combines the best of both worlds -- the strong
performance of supervised approaches and the flexibility of zero-shot methods.
This is achieved through a novel architectural design that seamlessly
integrates the degradation operator directly into the denoiser. In each block,
our proposed architecture applies the degradation operator on the network
activations and conditions the output using the attention mechanism, enabling
adaptation to diverse degradation scenarios while maintaining high performance.
Our work demonstrates the versatility of the proposed architecture, operating
as a general MMSE estimator, a posterior sampler, or a Neural Posterior
Principal Component estimator. This flexibility enables a wide range of
downstream tasks, highlighting the broad applicability of our framework. The
proposed modification of the denoiser network offers a versatile, accurate, and
computationally efficient solution, demonstrating the advantages of dedicated
network architectures for complex inverse problems. Experimental results on the
FFHQ and ImageNet datasets demonstrate state-of-the-art posterior-sampling
performance, surpassing both training-based and zero-shot alternatives.
☆ Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation
3D point cloud semantic segmentation (PCSS) is a cornerstone for
environmental perception in robotic systems and autonomous driving, enabling
precise scene understanding through point-wise classification. While
unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing
methods critically overlook the inherent vulnerability to real-world
perturbations (e.g., snow, fog, rain) and adversarial distortions. This work
first identifies two intrinsic limitations that undermine current PCSS-UDA
robustness: (a) unsupervised features overlap from unaligned boundaries in
shared-class regions and (b) feature structure erosion caused by
domain-invariant learning that suppresses target-specific patterns. To address
the proposed problems, we propose a tripartite framework consisting of: 1) a
robustness evaluation model quantifying resilience against adversarial
attack/corruption types through robustness metrics; 2) an invertible attention
alignment module (IAAM) enabling bidirectional domain mapping while preserving
discriminative structure via attention-guided overlap suppression; and 3) a
contrastive memory bank with quality-aware contrastive learning that
progressively refines pseudo-labels with feature quality for more
discriminative representations. Extensive experiments on
SynLiDAR-to-SemanticPOSS adaptation demonstrate a maximum mIoU improvement of
14.3\% under adversarial attack.
comment: 8 pages,6 figures
☆ CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition
Continuous sign language recognition (CSLR) focuses on interpreting and
transcribing sequences of sign language gestures in videos. In this work, we
propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that
leverages the powerful pre-trained visual encoder from the CLIP model to sign
language tasks through parameter-efficient fine-tuning (PEFT). We introduce two
variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP
visual encoder, enabling fine-tuning with minimal trainable parameters. The
effectiveness of the proposed frameworks is validated on four datasets:
Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA
variants outperformed several SOTA models with fewer trainable parameters.
Extensive ablation studies emphasize the effectiveness and flexibility of the
proposed methods with different vision-language models for CSLR. These findings
showcase the potential of adapting large-scale pre-trained models for scalable
and efficient CSLR, which pave the way for future advancements in sign language
understanding.
☆ BioAtt: Anatomical Prior Driven Low-Dose CT Denoising
Deep-learning-based denoising methods have significantly improved Low-Dose CT
(LDCT) image quality. However, existing models often over-smooth important
anatomical details due to their purely data-driven attention mechanisms. To
address this challenge, we propose a novel LDCT denoising framework, BioAtt.
The key innovation lies in attending anatomical prior distributions extracted
from the pretrained vision-language model BiomedCLIP. These priors guide the
denoising model to focus on anatomically relevant regions to suppress noise
while preserving clinically relevant structures. We highlight three main
contributions: BioAtt outperforms baseline and attention-based models in SSIM,
PSNR, and RMSE across multiple anatomical regions. The framework introduces a
new architectural paradigm by embedding anatomic priors directly into spatial
attention. Finally, BioAtt attention maps provide visual confirmation that the
improvements stem from anatomical guidance rather than increased model
complexity.
comment: 14 pages
☆ Robust Unsupervised Domain Adaptation for 3D Point Cloud Segmentation Under Source Adversarial Attacks
Unsupervised domain adaptation (UDA) frameworks have shown good
generalization capabilities for 3D point cloud semantic segmentation models on
clean data. However, existing works overlook adversarial robustness when the
source domain itself is compromised. To comprehensively explore the robustness
of the UDA frameworks, we first design a stealthy adversarial point cloud
generation attack that can significantly contaminate datasets with only minor
perturbations to the point cloud surface. Based on that, we propose a novel
dataset, AdvSynLiDAR, comprising synthesized contaminated LiDAR point clouds.
With the generated corrupted data, we further develop the Adversarial
Adaptation Framework (AAF) as the countermeasure. Specifically, by extending
the key point sensitive (KPS) loss towards the Robust Long-Tail loss (RLT loss)
and utilizing a decoder branch, our approach enables the model to focus on
long-tail classes during the pre-training phase and leverages high-confidence
decoded point cloud information to restore point cloud structures during the
adaptation phase. We evaluated our AAF method on the AdvSynLiDAR dataset, where
the results demonstrate that our AAF method can mitigate performance
degradation under source adversarial perturbations for UDA in the 3D point
cloud segmentation application.
☆ Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
The rapid advancement of Large Multi-modal Foundation Models (LMM) has paved
the way for the possible Explainable Image Quality Assessment (EIQA) with
instruction tuning from two perspectives: overall quality explanation, and
attribute-wise perception answering. However, existing works usually overlooked
the conflicts between these two types of perception explanations during joint
instruction tuning, leading to insufficient perception understanding. To
mitigate this, we propose a new paradigm for perception-oriented instruction
tuning, i.e., Q-Adapt, which aims to eliminate the conflicts and achieve the
synergy between these two EIQA tasks when adapting LMM, resulting in enhanced
multi-faceted explanations of IQA. Particularly, we propose a progressive
instruction tuning strategy by dividing the adaption process of LMM for EIQA
into two stages, where the first stage empowers the LMM with universal
perception knowledge tailored for two tasks using an efficient transfer
learning strategy, i.e., LoRA, and the second stage introduces the
instruction-adaptive visual prompt tuning to dynamically adapt visual features
for the different instructions from two tasks. In this way, our proposed
Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating
comparable performance and, in some instances, superior results across
perceptual-related benchmarks and commonly-used IQA databases. The source code
is publicly available at https://github.com/yeppp27/Q-Adapt.
☆ ProtoGuard-guided PROPEL: Class-Aware Prototype Enhancement and Progressive Labeling for Incremental 3D Point Cloud Segmentation
3D point cloud semantic segmentation technology has been widely used.
However, in real-world scenarios, the environment is evolving. Thus,
offline-trained segmentation models may lead to catastrophic forgetting of
previously seen classes. Class-incremental learning (CIL) is designed to
address the problem of catastrophic forgetting. While point clouds are common,
we observe high similarity and unclear boundaries between different classes.
Meanwhile, they are known to be imbalanced in class distribution. These lead to
issues including misclassification between similar classes and the long-tail
problem, which have not been adequately addressed in previous CIL methods. We
thus propose ProtoGuard and PROPEL (Progressive Refinement Of PsEudo-Labels).
In the base-class training phase, ProtoGuard maintains geometric and semantic
prototypes for each class, which are combined into prototype features using an
attention mechanism. In the novel-class training phase, PROPEL inherits the
base feature extractor and classifier, guiding pseudo-label propagation and
updates based on density distribution and semantic similarity. Extensive
experiments show that our approach achieves remarkable results on both the
S3DIS and ScanNet datasets, improving the mIoU of 3D point cloud segmentation
by a maximum of 20.39% under the 5-step CIL scenario on S3DIS.
☆ FlowR: Flowing from Sparse to Dense 3D Reconstructions
Tobias Fischer, Samuel Rota Bulò, Yung-Hsu Yang, Nikhil Varma Keetha, Lorenzo Porzi, Norman Müller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, Peter Kontschieder
3D Gaussian splatting enables high-quality novel view synthesis (NVS) at
real-time frame rates. However, its quality drops sharply as we depart from the
training views. Thus, dense captures are needed to match the high-quality
expectations of some applications, e.g. Virtual Reality (VR). However, such
dense captures are very laborious and expensive to obtain. Existing works have
explored using 2D generative models to alleviate this requirement by
distillation or generating additional training views. These methods are often
conditioned only on a handful of reference input views and thus do not fully
exploit the available 3D information, leading to inconsistent generation
results and reconstruction artifacts. To tackle this problem, we propose a
multi-view, flow matching model that learns a flow to connect novel view
renderings from possibly sparse reconstructions to renderings that we expect
from dense reconstructions. This enables augmenting scene captures with novel,
generated views to improve reconstruction quality. Our model is trained on a
novel dataset of 3.6M image pairs and can process up to 45 views at 540x960
resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline
consistently improves NVS in sparse- and dense-view scenarios, leading to
higher-quality reconstructions than prior works across multiple, widely-used
NVS benchmarks.
comment: Project page is available at https://tobiasfshr.github.io/pub/flowr
☆ Bridge 2D-3D: Uncertainty-aware Hierarchical Registration Network with Domain Alignment AAAI2025
The method for image-to-point cloud registration typically determines the
rigid transformation using a coarse-to-fine pipeline. However, directly and
uniformly matching image patches with point cloud patches may lead to focusing
on incorrect noise patches during matching while ignoring key ones. Moreover,
due to the significant differences between image and point cloud modalities, it
may be challenging to bridge the domain gap without specific improvements in
design. To address the above issues, we innovatively propose the
Uncertainty-aware Hierarchical Matching Module (UHMM) and the Adversarial Modal
Alignment Module (AMAM). Within the UHMM, we model the uncertainty of critical
information in image patches and facilitate multi-level fusion interactions
between image and point cloud features. In the AMAM, we design an adversarial
approach to reduce the domain gap between image and point cloud. Extensive
experiments and ablation studies on RGB-D Scene V2 and 7-Scenes benchmarks
demonstrate the superiority of our method, making it a state-of-the-art
approach for image-to-point cloud registration tasks.
comment: AAAI2025accept
☆ Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions
The robustness of DNNs is a crucial factor in safety-critical applications,
particularly in complex and dynamic environments where localized corruptions
can arise. While previous studies have evaluated the robustness of semantic
segmentation (SS) models under whole-image natural or adversarial corruptions,
a comprehensive investigation into the spatial robustness of dense vision
models under localized corruptions remained underexplored. This paper fills
this gap by introducing specialized metrics for benchmarking the spatial
robustness of segmentation models, alongside with an evaluation framework to
assess the impact of localized corruptions. Furthermore, we uncover the
inherent complexity of characterizing worst-case robustness using a single
localized adversarial perturbation. To address this, we propose region-aware
multi-attack adversarial analysis, a method that enables a deeper understanding
of model robustness against adversarial perturbations applied to specific
regions. The proposed metrics and analysis were evaluated on 15 segmentation
models in driving scenarios, uncovering key insights into the effects of
localized corruption in both natural and adversarial forms. The results reveal
that models respond to these two types of threats differently; for instance,
transformer-based segmentation models demonstrate notable robustness to
localized natural corruptions but are highly vulnerable to adversarial ones and
vice-versa for CNN-based models. Consequently, we also address the challenge of
balancing robustness to both natural and adversarial localized corruptions by
means of ensemble models, thereby achieving a broader threat coverage and
improved reliability for dense vision tasks.
comment: Under review
☆ A Conic Transformation Approach for Solving the Perspective-Three-Point Problem
We propose a conic transformation method to solve the Perspective-Three-Point
(P3P) problem. In contrast to the current state-of-the-art solvers, which
formulate the P3P problem by intersecting two conics and constructing a
degenerate conic to find the intersection, our approach builds upon a new
formulation based on a transformation that maps the two conics to a new
coordinate system, where one of the conics becomes a standard parabola in a
canonical form. This enables expressing one variable in terms of the other
variable, and as a consequence, substantially simplifies the problem of finding
the conic intersection. Moreover, the polynomial coefficients are fast to
compute, and we only need to determine the real-valued intersection points,
which avoids the requirement of using computationally expensive complex
arithmetic. While the current state-of-the-art methods reduce the conic
intersection problem to solving a univariate cubic equation, our approach,
despite resulting in a quartic equation, is still faster thanks to this new
simplified formulation. Extensive evaluations demonstrate that our method
achieves higher speed while maintaining robustness and stability comparable to
state-of-the-art methods.
☆ 3DBonsai: Structure-Aware Bonsai Modeling Using Conditioned 3D Gaussian Splatting ICME 2025
Recent advancements in text-to-3D generation have shown remarkable results by
leveraging 3D priors in combination with 2D diffusion. However, previous
methods utilize 3D priors that lack detailed and complex structural
information, limiting them to generating simple objects and presenting
challenges for creating intricate structures such as bonsai. In this paper, we
propose 3DBonsai, a novel text-to-3D framework for generating 3D bonsai with
complex structures. Technically, we first design a trainable 3D space
colonization algorithm to produce bonsai structures, which are then enhanced
through random sampling and point cloud augmentation to serve as the 3D
Gaussian priors. We introduce two bonsai generation pipelines with distinct
structural levels: fine structure conditioned generation, which initializes 3D
Gaussians using a 3D structure prior to produce detailed and complex bonsai,
and coarse structure conditioned generation, which employs a multi-view
structure consistency module to align 2D and 3D structures. Moreover, we have
compiled a unified 2D and 3D Chinese-style bonsai dataset. Our experimental
results demonstrate that 3DBonsai significantly outperforms existing methods,
providing a new benchmark for structure-aware 3D bonsai generation.
comment: Accepted by ICME 2025
☆ A$^\text{T}$A: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting CVPR 2025
Yizhe Tang, Zhimin Sun, Yuzhen Du, Ran Yi, Guangben Lu, Teng Hu, Luying Li, Lizhuang Ma, Fangyuan Zou
Image inpainting aims to fill the missing region of an image. Recently, there
has been a surge of interest in foreground-conditioned background inpainting, a
sub-task that fills the background of an image while the foreground subject and
associated text prompt are provided. Existing background inpainting methods
typically strictly preserve the subject's original position from the source
image, resulting in inconsistencies between the subject and the generated
background. To address this challenge, we propose a new task, the "Text-Guided
Subject-Position Variable Background Inpainting", which aims to dynamically
adjust the subject position to achieve a harmonious relationship between the
subject and the inpainted background, and propose the Adaptive Transformation
Agent (A$^\text{T}$A) for this task. Firstly, we design a PosAgent Block that
adaptively predicts an appropriate displacement based on given features to
achieve variable subject-position. Secondly, we design the Reverse Displacement
Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse
structure, to transform hierarchical feature maps from deep to shallow based on
semantic information. Thirdly, we equip A$^\text{T}$A with a Position Switch
Embedding to control whether the subject's position in the generated image is
adaptively predicted or fixed. Extensive comparative experiments validate the
effectiveness of our A$^\text{T}$A approach, which not only demonstrates
superior inpainting capabilities in subject-position variable inpainting, but
also ensures good performance on subject-position fixed inpainting.
comment: Accepted by CVPR 2025
☆ A topology-preserving three-stage framework for fully-connected coronary artery extraction
Coronary artery extraction is a crucial prerequisite for computer-aided
diagnosis of coronary artery disease. Accurately extracting the complete
coronary tree remains challenging due to several factors, including presence of
thin distal vessels, tortuous topological structures, and insufficient
contrast. These issues often result in over-segmentation and under-segmentation
in current segmentation methods. To address these challenges, we propose a
topology-preserving three-stage framework for fully-connected coronary artery
extraction. This framework includes vessel segmentation, centerline
reconnection, and missing vessel reconstruction. First, we introduce a new
centerline enhanced loss in the segmentation process. Second, for the broken
vessel segments, we further propose a regularized walk algorithm to integrate
distance, probabilities predicted by a centerline classifier, and directional
cosine similarity, for reconnecting the centerlines. Third, we apply implicit
neural representation and implicit modeling, to reconstruct the geometric model
of the missing vessels. Experimental results show that our proposed framework
outperforms existing methods, achieving Dice scores of 88.53\% and 85.07\%,
with Hausdorff Distances (HD) of 1.07mm and 1.63mm on ASOCA and PDSCA datasets,
respectively. Code will be available at https://github.com/YH-Qiu/CorSegRec.
☆ DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image
Depth enhancement, which uses RGB images as guidance to convert raw signals
from dToF into high-precision, dense depth maps, is a critical task in computer
vision. Although existing super-resolution-based methods show promising results
on public datasets, they often rely on idealized assumptions like accurate
region correspondences and reliable dToF inputs, overlooking calibration errors
that cause misalignment and anomaly signals inherent to dToF imaging, limiting
real-world applicability. To address these challenges, we propose a novel
completion-based method, named DEPTHOR, featuring advances in both the training
strategy and model architecture. First, we propose a method to simulate
real-world dToF data from the accurate ground truth in synthetic datasets to
enable noise-robust training. Second, we design a novel network that
incorporates monocular depth estimation (MDE), leveraging global depth
relationships and contextual information to improve prediction in challenging
regions. On the ZJU-L5 dataset, our training strategy significantly enhances
depth completion models, achieving results comparable to depth super-resolution
methods, while our model achieves state-of-the-art results, improving Rel and
RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we
collected, our method outperforms SOTA methods on preliminary stereo-based GT,
improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at
https://github.com/ShadowBbBb/Depthor
comment: 10 pages, 8 figures, 7 tables
☆ Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
Video retrieval requires aligning visual content with corresponding natural
language descriptions. In this paper, we introduce Modality Auxiliary Concepts
for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific
tags -- automatically extracted from foundation models -- to enhance video
retrieval. We propose to align modalities in a latent space, along with
learning and aligning auxiliary latent concepts, derived from the features of a
video and its corresponding caption. We introduce these auxiliary concepts to
improve the alignment of visual and textual latent concepts, and so are able to
distinguish concepts from one other. We conduct extensive experiments on five
diverse datasets: MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The
experimental results consistently demonstrate that modality-specific tags
improve cross-modal alignment, outperforming current state-of-the-art methods
across three datasets and performing comparably or better across the other two.
☆ Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models
Vision-language models (VLMs) have advanced rapidly in processing multimodal
information, but their ability to reconcile conflicting signals across
modalities remains underexplored. This work investigates how VLMs process ASCII
art, a unique medium where textual elements collectively form visual patterns,
potentially creating semantic-visual conflicts. We introduce a novel evaluation
framework that systematically challenges five state-of-the-art models
(including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where
character-level semantics deliberately contradict global visual patterns. Our
experiments reveal a strong text-priority bias: VLMs consistently prioritize
textual information over visual patterns, with visual recognition ability
declining dramatically as semantic complexity increases. Various mitigation
attempts through visual parameter tuning and prompt engineering yielded only
modest improvements, suggesting that this limitation requires
architectural-level solutions. These findings uncover fundamental flaws in how
current VLMs integrate multimodal information, providing important guidance for
future model development while highlighting significant implications for
content moderation systems vulnerable to adversarial examples.
comment: Under review at COLM 2025
☆ Instance Migration Diffusion for Nuclear Instance Segmentation in Pathology
Nuclear instance segmentation plays a vital role in disease diagnosis within
digital pathology. However, limited labeled data in pathological images
restricts the overall performance of nuclear instance segmentation. To tackle
this challenge, we propose a novel data augmentation framework Instance
Migration Diffusion Model (IM-Diffusion), IM-Diffusion designed to generate
more varied pathological images by constructing diverse nuclear layouts and
internuclear spatial relationships. In detail, we introduce a Nuclear Migration
Module (NMM) which constructs diverse nuclear layouts by simulating the process
of nuclear migration. Building on this, we further present an
Internuclear-regions Inpainting Module (IIM) to generate diverse internuclear
spatial relationships by structure-aware inpainting. On the basis of the above,
IM-Diffusion generates more diverse pathological images with different layouts
and internuclear spatial relationships, thereby facilitating downstream tasks.
Evaluation on the CoNSeP and GLySAC datasets demonstrate that the images
generated by IM-Diffusion effectively enhance overall instance segmentation
performance. Code will be made public later.
☆ Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation
We present Pro-DG, a framework for procedurally controllable photo-realistic
facade generation that combines a procedural shape grammar with diffusion-based
image synthesis. Starting from a single input image, we reconstruct its facade
layout using grammar rules, then edit that structure through user-defined
transformations. As facades are inherently multi-hierarchical structures, we
introduce hierarchical matching procedure that aligns facade structures at
different levels which is used to introduce control maps to guide a generative
diffusion pipeline. This approach retains local appearance fidelity while
accommodating large-scale edits such as floor duplication or window
rearrangement. We provide a thorough evaluation, comparing Pro-DG against
inpainting-based baselines and synthetic ground truths. Our user study and
quantitative measurements indicate improved preservation of architectural
identity and higher edit accuracy. Our novel method is the first to integrate
neuro-symbolically derived shape-grammars for modeling with modern generative
model and highlights the broader potential of such approaches for precise and
controllable image manipulation.
comment: 12 pages, 13 figures
☆ STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation
Accurate segmentation of lesions plays a critical role in medical image
analysis and diagnosis. Traditional segmentation approaches that rely solely on
visual features often struggle with the inherent uncertainty in lesion
distribution and size. To address these issues, we propose STPNet, a
Scale-aware Text Prompt Network that leverages vision-language modeling to
enhance medical image segmentation. Our approach utilizes multi-scale textual
descriptions to guide lesion localization and employs retrieval-segmentation
joint learning to bridge the semantic gap between visual and linguistic
modalities. Crucially, STPNet retrieves relevant textual information from a
specialized medical text repository during training, eliminating the need for
text input during inference while retaining the benefits of cross-modal
learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and
Kvasir-SEG. Experimental results show that our vision-language approach
outperforms state-of-the-art segmentation methods, demonstrating the
effectiveness of incorporating textual semantic knowledge into medical image
analysis. The code has been made publicly on
https://github.com/HUANGLIZI/STPNet.
☆ RealityAvatar: Towards Realistic Loose Clothing Modeling in Animatable 3D Gaussian Avatars
Modeling animatable human avatars from monocular or multi-view videos has
been widely studied, with recent approaches leveraging neural radiance fields
(NeRFs) or 3D Gaussian Splatting (3DGS) achieving impressive results in
novel-view and novel-pose synthesis. However, existing methods often struggle
to accurately capture the dynamics of loose clothing, as they primarily rely on
global pose conditioning or static per-frame representations, leading to
oversmoothing and temporal inconsistencies in non-rigid regions. To address
this, We propose RealityAvatar, an efficient framework for high-fidelity
digital human modeling, specifically targeting loosely dressed avatars. Our
method leverages 3D Gaussian Splatting to capture complex clothing deformations
and motion dynamics while ensuring geometric consistency. By incorporating a
motion trend module and a latentbone encoder, we explicitly model
pose-dependent deformations and temporal variations in clothing behavior.
Extensive experiments on benchmark datasets demonstrate the effectiveness of
our approach in capturing fine-grained clothing deformations and motion-driven
shape variations. Our method significantly enhances structural fidelity and
perceptual quality in dynamic human reconstruction, particularly in non-rigid
regions, while achieving better consistency across temporal frames.
☆ Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training
Supervised deep learning for semantic segmentation has achieved excellent
results in accurately identifying anatomical and pathological structures in
medical images. However, it often requires large annotated training datasets,
which limits its scalability in clinical settings. To address this challenge,
semi-supervised learning is a well-established approach that leverages both
labeled and unlabeled data. In this paper, we introduce a novel semi-supervised
teacher-student framework for biomedical image segmentation, inspired by the
recent success of generative models. Our approach leverages denoising diffusion
probabilistic models (DDPMs) to generate segmentation masks by progressively
refining noisy inputs conditioned on the corresponding images. The teacher
model is first trained in an unsupervised manner using a cycle-consistency
constraint based on noise-corrupted image reconstruction, enabling it to
generate informative semantic masks. Subsequently, the teacher is integrated
into a co-training process with a twin-student network. The student learns from
ground-truth labels when available and from teacher-generated pseudo-labels
otherwise, while the teacher continuously improves its pseudo-labeling
capabilities. Finally, to further enhance performance, we introduce a
multi-round pseudo-label generation strategy that iteratively improves the
pseudo-labeling process. We evaluate our approach on multiple biomedical
imaging benchmarks, spanning multiple imaging modalities and segmentation
tasks. Experimental results show that our method consistently outperforms
state-of-the-art semi-supervised techniques, highlighting its effectiveness in
scenarios with limited annotated data. The code to replicate our experiments
can be found at
https://github.com/ciampluca/diffusion_semi_supervised_biomedical_image_segmentation
☆ Beyond Nearest Neighbor Interpolation in Data Augmentation
Avoiding the risk of undefined categorical labels using nearest neighbor
interpolation overlooks the risk of exacerbating pixel level annotation errors
in data augmentation. To simultaneously avoid these risks, the author modified
convolutional neural networks data transformation functions by incorporating a
modified geometric transformation function to improve the quality of augmented
data by removing the reliance on nearest neighbor interpolation and integrating
a mean based class filtering mechanism to handle undefined categorical labels
with alternative interpolation algorithms. Experiments on semantic segmentation
tasks using three medical image datasets demonstrated both qualitative and
quantitative improvements with alternative interpolation algorithms.
comment: 6 pages, 9 figures, 1 table
☆ Domain Guidance: A Simple Transfer Approach for a Pre-trained Diffusion Model
Recent advancements in diffusion models have revolutionized generative
modeling. However, the impressive and vivid outputs they produce often come at
the cost of significant model scaling and increased computational demands.
Consequently, building personalized diffusion models based on off-the-shelf
models has emerged as an appealing alternative. In this paper, we introduce a
novel perspective on conditional generation for transferring a pre-trained
model. From this viewpoint, we propose *Domain Guidance*, a straightforward
transfer approach that leverages pre-trained knowledge to guide the sampling
process toward the target domain. Domain Guidance shares a formulation similar
to advanced classifier-free guidance, facilitating better domain alignment and
higher-quality generations. We provide both empirical and theoretical analyses
of the mechanisms behind Domain Guidance. Our experimental results demonstrate
its substantial effectiveness across various transfer benchmarks, achieving
over a 19.6% improvement in FID and a 23.4% improvement in FD$_\text{DINOv2}$
compared to standard fine-tuning. Notably, existing fine-tuned models can
seamlessly integrate Domain Guidance to leverage these benefits, without
additional training.
☆ Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis
Conditional image synthesis is a crucial task with broad applications, such
as artistic creation and virtual reality. However, current generative methods
are often task-oriented with a narrow scope, handling a restricted condition
with constrained applicability. In this paper, we propose a novel approach that
treats conditional image synthesis as the modular combination of diverse
fundamental condition units. Specifically, we divide conditions into three
primary units: text, layout, and drag. To enable effective control over these
conditions, we design a dedicated alignment module for each. For the text
condition, we introduce a Dense Concept Alignment (DCA) module, which achieves
dense visual-text alignment by drawing on diverse textual concepts. For the
layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce
comprehensive geometric constraints that preserve the spatial configuration.
For the drag condition, we introduce a Dense Motion Alignment (DMA) module to
apply multi-level motion regularization, ensuring that each pixel follows its
desired trajectory without visual artifacts. By flexibly inserting and
combining these alignment modules, our framework enhances the model's
adaptability to diverse conditional generation tasks and greatly expands its
application range. Extensive experiments demonstrate the superior performance
of our framework across a variety of conditions, including textual description,
segmentation mask (bounding box), drag manipulation, and their combinations.
Code is available at https://github.com/ZixuanWang0525/DADG.
☆ High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model
Recently single-view 3D generation via Gaussian splatting has emerged and
developed quickly. They learn 3D Gaussians from 2D RGB images generated from
pre-trained multi-view diffusion (MVD) models, and have shown a promising
avenue for 3D generation through a single image. Despite the current progress,
these methods still suffer from the inconsistency jointly caused by the
geometric ambiguity in the 2D images, and the lack of structure of 3D
Gaussians, leading to distorted and blurry 3D object generation. In this paper,
we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian
Reconstruction Model designed to generate high-fidelity 3D objects from
single-view images. Our key insight is a structured 3D representation can
simultaneously mitigate the afore-mentioned two issues. To this end, we propose
a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation
contains explicit 3D geometric information, eliminating the geometric ambiguity
from 2D images. It also structures Gaussians during learning so that the
optimization tends to find better local optima. Our 3D voxel representation is
obtained by a fusion module that aligns RGB features and surface normal
features, both of which can be estimated from 2D images. Extensive experiments
demonstrate the superiority of our methods over prior works in terms of
high-quality reconstruction results, robust generalization, and good
efficiency.
comment: 12 pages
☆ Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment CVPR 2025
Capturing high-quality photographs under diverse real-world lighting
conditions is challenging, as both natural lighting (e.g., low-light) and
camera exposure settings (e.g., exposure time) significantly impact image
quality. This challenge becomes more pronounced in multi-view scenarios, where
variations in lighting and image signal processor (ISP) settings across
viewpoints introduce photometric inconsistencies. Such lighting degradations
and view-dependent variations pose substantial challenges to novel view
synthesis (NVS) frameworks based on Neural Radiance Fields (NeRF) and 3D
Gaussian Splatting (3DGS). To address this, we introduce Luminance-GS, a novel
approach to achieving high-quality novel view synthesis results under diverse
challenging lighting conditions using 3DGS. By adopting per-view color matrix
mapping and view-adaptive curve adjustments, Luminance-GS achieves
state-of-the-art (SOTA) results across various lighting conditions -- including
low-light, overexposure, and varying exposure -- while not altering the
original 3DGS explicit representation. Compared to previous NeRF- and
3DGS-based baselines, Luminance-GS provides real-time rendering speed with
improved reconstruction quality.
comment: CVPR 2025, project page:
https://cuiziteng.github.io/Luminance_GS_web/
☆ GarmageNet: A Dataset and Scalable Representation for Generic Garment Modeling
High-fidelity garment modeling remains challenging due to the lack of
large-scale, high-quality datasets and efficient representations capable of
handling non-watertight, multi-layer geometries. In this work, we introduce
Garmage, a neural-network-and-CG-friendly garment representation that
seamlessly encodes the accurate geometry and sewing pattern of complex
multi-layered garments as a structured set of per-panel geometry images. As a
dual-2D-3D representation, Garmage achieves an unprecedented integration of 2D
image-based algorithms with 3D modeling workflows, enabling high fidelity,
non-watertight, multi-layered garment geometries with direct compatibility for
industrial-grade simulations.Built upon this representation, we present
GarmageNet, a novel generation framework capable of producing detailed
multi-layered garments with body-conforming initial geometries and intricate
sewing patterns, based on user prompts or existing in-the-wild sewing patterns.
Furthermore, we introduce a robust stitching algorithm that recovers per-vertex
stitches, ensuring seamless integration into flexible simulation pipelines for
downstream editing of sewing patterns, material properties, and dynamic
simulations. Finally, we release an industrial-standard, large-scale,
high-fidelity garment dataset featuring detailed annotations, vertex-wise
correspondences, and a robust pipeline for converting unstructured production
sewing patterns into GarmageNet standard structural assets, paving the way for
large-scale, industrial-grade garment generation systems.
☆ Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction ICME 2025
Cross-modal 3D retrieval is a critical yet challenging task, aiming to
achieve bi-directional retrieval between 3D and text modalities. Current
methods predominantly rely on a certain 3D representation (e.g., point cloud),
with few exploiting the 2D-3D consistency and complementary relationships,
which constrains their performance. To bridge this gap, we propose to adopt
multi-view images and point clouds to jointly represent 3D shapes, facilitating
tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D
retrieval. Notably, we introduce tri-modal reconstruction to improve the
generalization ability of encoders. Given point features, we reconstruct image
features under the guidance of text features, and vice versa. With well-aligned
point cloud and multi-view image features, we aggregate them as multimodal
embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic
understanding. Recognizing the significant noise in current datasets where many
3D shapes and texts share similar semantics, we employ hard negative
contrastive training to emphasize harder negatives with greater significance,
leading to robust discriminative embeddings. Extensive experiments on the
Text2Shape dataset demonstrate that our method significantly outperforms
previous state-of-the-art methods in both shape-to-text and text-to-shape
retrieval tasks by a substantial margin.
comment: ICME 2025
☆ ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Egocentric interaction perception is one of the essential branches in
investigating human-environment interaction, which lays the basis for
developing next-generation intelligent systems. However, existing egocentric
interaction understanding methods cannot yield coherent textual and pixel-level
responses simultaneously according to user queries, which lacks flexibility for
varying downstream application requirements. To comprehend egocentric
interactions exhaustively, this paper presents a novel task named Egocentric
Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image
with the query as input, Ego-IRG is the first task that aims to resolve the
interactions through three crucial steps: analyzing, answering, and pixel
grounding, which results in fluent textual and fine-grained pixel-level
responses. Another challenge is that existing datasets cannot meet the
conditions for the Ego-IRG task. To address this limitation, this paper creates
the Ego-IRGBench dataset based on extensive manual efforts, which includes over
20k egocentric images with 1.6 million queries and corresponding multimodal
responses about interactions. Moreover, we design a unified ANNEXE model to
generate text- and pixel-level outputs utilizing multimodal large language
models, which enables a comprehensive interpretation of egocentric
interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of
our ANNEXE model compared with other works.
comment: Computer Vision and Pattern Recognition
☆ Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
Deepfakes are AI-generated media in which the original content is digitally
altered to create convincing but manipulated images, videos, or audio. Among
the various types of deepfakes, lip-syncing deepfakes are one of the most
challenging deepfakes to detect. In these videos, a person's lip movements are
synthesized to match altered or entirely new audio using AI models. Therefore,
unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are
confined to the mouth region, making them more subtle and, thus harder to
discern. In this paper, we propose LIPINC-V2, a novel detection framework that
leverages a combination of vision temporal transformer with multihead
cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal
inconsistencies in the mouth region. These inconsistencies appear across
adjacent frames and persist throughout the video. Our model can successfully
capture both short-term and long-term variations in mouth movement, enhancing
its ability to detect these inconsistencies. Additionally, we created a new
lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five
state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive
experiments on our proposed LipSyncTIMIT dataset and two other benchmark
deepfake datasets demonstrate that our model achieves state-of-the-art
performance. The code and the dataset are available at
https://github.com/skrantidatta/LIPINC-V2 .
☆ Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes CVPR 2025
Mesh saliency enhances the adaptability of 3D vision by identifying and
emphasizing regions that naturally attract visual attention. To investigate the
interaction between geometric structure and texture in shaping visual
attention, we establish a comprehensive mesh saliency dataset, which is the
first to systematically capture the differences in saliency distribution under
both textured and non-textured visual conditions. Furthermore, we introduce
mesh Mamba, a unified saliency prediction model based on a state space model
(SSM), designed to adapt across various mesh types. Mesh Mamba effectively
analyzes the geometric structure of the mesh while seamlessly incorporating
texture features into the topological framework, ensuring coherence throughout
appearance-enhanced modeling. More importantly, by subgraph embedding and a
bidirectional SSM, the model enables global context modeling for both local
geometry and texture, preserving the topological structure and improving the
understanding of visual details and structural complexity. Through extensive
theoretical and empirical validation, our model not only improves performance
across various mesh types but also demonstrates high scalability and
versatility, particularly through cross validations of various visual features.
comment: to be published in CVPR 2025
☆ Deep LG-Track: An Enhanced Localization-Confidence-Guided Multi-Object Tracker
Multi-object tracking plays a crucial role in various applications, such as
autonomous driving and security surveillance. This study introduces Deep
LG-Track, a novel multi-object tracker that incorporates three key enhancements
to improve the tracking accuracy and robustness. First, an adaptive Kalman
filter is developed to dynamically update the covariance of measurement noise
based on detection confidence and trajectory disappearance. Second, a novel
cost matrix is formulated to adaptively fuse motion and appearance information,
leveraging localization confidence and detection confidence as weighting
factors. Third, a dynamic appearance feature updating strategy is introduced,
adjusting the relative weighting of historical and current appearance features
based on appearance clarity and localization accuracy. Comprehensive
evaluations on the MOT17 and MOT20 datasets demonstrate that the proposed Deep
LG-Track consistently outperforms state-of-the-art trackers across multiple
performance metrics, highlighting its effectiveness in multi-object tracking
tasks.
comment: 11 pages, 6 fugures
☆ BiSeg-SAM: Weakly-Supervised Post-Processing Framework for Boosting Binary Segmentation in Segment Anything Models
Accurate segmentation of polyps and skin lesions is essential for diagnosing
colorectal and skin cancers. While various segmentation methods for polyps and
skin lesions using fully supervised deep learning techniques have been
developed, the pixel-level annotation of medical images by doctors is both
time-consuming and costly. Foundational vision models like the Segment Anything
Model (SAM) have demonstrated superior performance; however, directly applying
SAM to medical segmentation may not yield satisfactory results due to the lack
of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a
SAM-guided weakly supervised prompting and boundary refinement network for the
segmentation of polyps and skin lesions. Specifically, we fine-tune SAM
combined with a CNN module to learn local features. We introduce a WeakBox with
two functions: automatically generating box prompts for the SAM model and using
our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough
mask-to-box conversion, addressing the mismatch between coarse labels and
precise predictions. Additionally, we apply scale consistency (SC) loss for
prediction scale alignment. Our DetailRefine module enhances boundary precision
and segmentation accuracy by refining coarse predictions using a limited amount
of ground truth labels. This comprehensive approach enables BiSeg-SAM to
achieve excellent multi-task segmentation performance. Our method demonstrates
significant superiority over state-of-the-art (SOTA) methods when tested on
five polyp datasets and one skin cancer dataset.
comment: 2024 IEEE International Conference on Bioinformatics and Biomedicine
(BIBM)
☆ Multimodal Point Cloud Semantic Segmentation With Virtual Point Enhancement
LiDAR-based 3D point cloud recognition has been proven beneficial in various
applications. However, the sparsity and varying density pose a significant
challenge in capturing intricate details of objects, particularly for
medium-range and small targets. Therefore, we propose a multi-modal point cloud
semantic segmentation method based on Virtual Point Enhancement (VPE), which
integrates virtual points generated from images to address these issues. These
virtual points are dense but noisy, and directly incorporating them can
increase computational burden and degrade performance. Therefore, we introduce
a spatial difference-driven adaptive filtering module that selectively extracts
valuable pseudo points from these virtual points based on density and distance,
enhancing the density of medium-range targets. Subsequently, we propose a
noise-robust sparse feature encoder that incorporates noise-robust feature
extraction and fine-grained feature enhancement. Noise-robust feature
extraction exploits the 2D image space to reduce the impact of noisy points,
while fine-grained feature enhancement boosts sparse geometric features through
inner-voxel neighborhood point aggregation and downsampled voxel aggregation.
The results on the SemanticKITTI and nuScenes, two large-scale benchmark data
sets, have validated effectiveness, significantly improving 2.89\% mIoU with
the introduction of 7.7\% virtual points on nuScenes.
☆ MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation
Optical coherence tomography angiography (OCTA) shows its great importance in
imaging microvascular networks by providing accurate 3D imaging of blood
vessels, but it relies upon specialized sensors and expensive devices. For this
reason, previous works show the potential to translate the readily available 3D
Optical Coherence Tomography (OCT) images into 3D OCTA images. However,
existing OCTA translation methods directly learn the mapping from the OCT
domain to the OCTA domain in continuous and infinite space with guidance from
only a single view, i.e., the OCTA project map, resulting in suboptimal
results. To this end, we propose the multi-view Tri-alignment framework for OCT
to OCTA 3D image translation in discrete and finite space, named MuTri. In the
first stage, we pre-train two vector-quantized variational auto-encoder (VQ-
VAE) by reconstructing 3D OCT and 3D OCTA data, providing semantic prior for
subsequent multi-view guidances. In the second stage, our multi-view
tri-alignment facilitates another VQVAE model to learn the mapping from the OCT
domain to the OCTA domain in discrete and finite space. Specifically, a
contrastive-inspired semantic alignment is proposed to maximize the mutual
information with the pre-trained models from OCT and OCTA views, to facilitate
codebook learning. Meanwhile, a vessel structure alignment is proposed to
minimize the structure discrepancy with the pre-trained models from the OCTA
project map view, benefiting from learning the detailed vessel structure
information. We also collect the first large-scale dataset, namely, OCTA2024,
which contains a pair of OCT and OCTA volumes from 846 subjects.
☆ TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Large video-language models (LVLMs) have shown remarkable performance across
various video-language tasks. However, they encounter significant challenges
when processing long videos because of the large number of video frames
involved. Downsampling long videos in either space or time can lead to visual
hallucinations, making it difficult to accurately interpret long videos.
Motivated by human hierarchical temporal search strategies, we propose
\textbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos
in a human-like manner. TimeSearch integrates two human-like primitives into a
unified autoregressive LVLM: 1) \textbf{Spotlight} efficiently identifies
relevant temporal events through a Temporal-Augmented Frame Representation
(TAFR), explicitly binding visual features with timestamps; 2)
\textbf{Reflection} evaluates the correctness of the identified events,
leveraging the inherent temporal self-reflection capabilities of LVLMs.
TimeSearch progressively explores key events and prioritizes temporal search
based on reflection confidence. Extensive experiments on challenging long-video
benchmarks confirm that TimeSearch substantially surpasses previous
state-of-the-art, improving the accuracy from 41.8\% to 51.5\% on the LVBench.
Additionally, experiments on temporal grounding demonstrate that appropriate
TAFR is adequate to effectively stimulate the surprising temporal grounding
ability of LVLMs in a simpler yet versatile manner, which improves mIoU on
Charades-STA by 11.8\%. The code will be released.
☆ Leveraging Generalizability of Image-to-Image Translation for Enhanced Adversarial Defense
In the rapidly evolving field of artificial intelligence, machine learning
emerges as a key technology characterized by its vast potential and inherent
risks. The stability and reliability of these models are important, as they are
frequent targets of security threats. Adversarial attacks, first rigorously
defined by Ian Goodfellow et al. in 2013, highlight a critical vulnerability:
they can trick machine learning models into making incorrect predictions by
applying nearly invisible perturbations to images. Although many studies have
focused on constructing sophisticated defensive mechanisms to mitigate such
attacks, they often overlook the substantial time and computational costs of
training and maintaining these models. Ideally, a defense method should be able
to generalize across various, even unseen, adversarial attacks with minimal
overhead. Building on our previous work on image-to-image translation-based
defenses, this study introduces an improved model that incorporates residual
blocks to enhance generalizability. The proposed method requires training only
a single model, effectively defends against diverse attack types, and is
well-transferable between different target models. Experiments show that our
model can restore the classification accuracy from near zero to an average of
72\% while maintaining competitive performance compared to state-of-the-art
methods.
☆ All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning
Zheng Yang, Ruoxin Chen, Zhiyuan Yan, Ke-Yue Zhang, Xinghe Fu, Shuang Wu, Xiujun Shu, Taiping Yao, Junchi Yan, Shouhong Ding, Xi Li
The exponential growth of AI-generated images (AIGIs) underscores the urgent
need for robust and generalizable detection methods. In this paper, we
establish two key principles for AIGI detection through systematic analysis:
\textbf{(1) All Patches Matter:} Unlike conventional image classification where
discriminative features concentrate on object-centric regions, each patch in
AIGIs inherently contains synthetic artifacts due to the uniform generation
process, suggesting that every patch serves as an important artifact source for
detection. \textbf{(2) More Patches Better}: Leveraging distributed artifacts
across more patches improves detection robustness by capturing complementary
forensic evidence and reducing over-reliance on specific patches, thereby
enhancing robustness and generalization. However, our counterfactual analysis
reveals an undesirable phenomenon: naively trained detectors often exhibit a
\textbf{Few-Patch Bias}, discriminating between real and synthetic images based
on minority patches. We identify \textbf{Lazy Learner} as the root cause:
detectors preferentially learn conspicuous artifacts in limited patches while
neglecting broader artifact distributions. To address this bias, we propose the
\textbf{P}anoptic \textbf{P}atch \textbf{L}earning (PPL) framework, involving:
(1) Random Patch Replacement that randomly substitutes synthetic patches with
real counterparts to compel models to identify artifacts in underutilized
regions, encouraging the broader use of more patches; (2) Patch-wise
Contrastive Learning that enforces consistent discriminative capability across
all patches, ensuring uniform utilization of all patches. Extensive experiments
across two different settings on several benchmarks verify the effectiveness of
our approach.
☆ DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data
Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising
performance in domain-specific data (e.g., biology), and has attracted
increasing research attention. Existing works generally focus on collecting
extensive domain-specific data and directly tuning the original CLIP models.
Intuitively, such a paradigm takes no full consideration of the characteristics
lying in domain-specific data (e.g., fine-grained nature of biological data)
and so limits model capability, while mostly losing the original ability of
CLIP in the general domain. In this paper, we propose a Distribution
Alignment-based Language-Image Pre-Training (DALIP) method for biological data.
Specifically, DALIP optimizes CLIP models by matching the similarity between
feature distribution of image-text pairs instead of the original [cls] token,
which can capture rich yet effective information inherent in image-text pairs
as powerful representations, and so better cope with fine-grained nature of
biological data. Particularly, our DALIP efficiently approximates feature
distribution via its first- and second-order statistics, while presenting a
Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order
statistics of token features efficiently. Furthermore, we collect a new dataset
for plant domain (e.g., specific data in biological domain) comprising 10M
plant data with 3M general-domain data (namely PlantMix-13M) according to data
mixing laws. Extensive experiments show that DALIP clearly outperforms existing
CLIP counterparts in biological domain, while well generalizing to remote
sensing and medical imaging domains. Besides, our PlantMix-13M dataset further
boosts performance of DALIP in plant domain, while preserving model ability in
general domain.
comment: 14 pages
☆ v-CLR: View-Consistent Learning for Open-World Instance Segmentation CVPR 2025
In this paper, we address the challenging problem of open-world instance
segmentation. Existing works have shown that vanilla visual networks are biased
toward learning appearance information, \eg texture, to recognize objects. This
implicit bias causes the model to fail in detecting novel objects with unseen
textures in the open-world setting. To address this challenge, we propose a
learning framework, called view-Consistent LeaRning (v-CLR), which aims to
enforce the model to learn appearance-invariant representations for robust
instance segmentation. In v-CLR, we first introduce additional views for each
image, where the texture undergoes significant alterations while preserving the
image's underlying structure. We then encourage the model to learn the
appearance-invariant representation by enforcing the consistency between object
features across different views, for which we obtain class-agnostic object
proposals using off-the-shelf unsupervised models that possess strong
object-awareness. These proposals enable cross-view object feature matching,
greatly reducing the appearance dependency while enhancing the
object-awareness. We thoroughly evaluate our method on public benchmarks under
both cross-class and cross-dataset settings, achieving state-of-the-art
performance. Project page: https://visual-ai.github.io/vclr
comment: Accepted by CVPR 2025, Project page:
https://visual-ai.github.io/vclr, Code: https://github.com/Visual-AI/vCLR
☆ 3D Gaussian Inverse Rendering with Approximated Global Illumination
3D Gaussian Splatting shows great potential in reconstructing photo-realistic
3D scenes. However, these methods typically bake illumination into their
representations, limiting their use for physically-based rendering and scene
editing. Although recent inverse rendering approaches aim to decompose scenes
into material and lighting components, they often rely on simplifying
assumptions that fail when editing. We present a novel approach that enables
efficient global illumination for 3D Gaussians Splatting through screen-space
ray tracing. Our key insight is that a substantial amount of indirect light can
be traced back to surfaces visible within the current view frustum. Leveraging
this observation, we augment the direct shading computed by 3D Gaussians with
Monte-Carlo screen-space ray-tracing to capture one-bounce indirect
illumination. In this way, our method enables realistic global illumination
without sacrificing the computational efficiency and editability benefits of 3D
Gaussians. Through experiments, we show that the screen-space approximation we
utilize allows for indirect illumination and supports real-time rendering and
editing. Code, data, and models will be made available at our project page:
https://wuzirui.github.io/gs-ssr.
☆ Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval CVPR 2025
The goal of this paper is to enhance pretrained Vision Transformer (ViT)
models for focus-oriented image retrieval with visual prompting. In real-world
image retrieval scenarios, both query and database images often exhibit
complexity, with multiple objects and intricate backgrounds. Users often want
to retrieve images with specific object, which we define as the Focus-Oriented
Image Retrieval (FOIR) task. While a standard image encoder can be employed to
extract image features for similarity matching, it may not perform optimally in
the multi-object-based FOIR task. This is because each image is represented by
a single global feature vector. To overcome this, a prompt-based image
retrieval solution is required. We propose an approach called Prompt-guided
attention Head Selection (PHS) to leverage the head-wise potential of the
multi-head attention mechanism in ViT in a promptable manner. PHS selects
specific attention heads by matching their attention maps with user's visual
prompts, such as a point, box, or segmentation. This empowers the model to
focus on specific object of interest while preserving the surrounding visual
context. Notably, PHS does not necessitate model re-training and avoids any
image alteration. Experimental results show that PHS substantially improves
performance on multiple datasets, offering a practical and training-free
solution to enhance model performance in the FOIR task.
comment: Accepted to CVPR 2025 PixFoundation Workshop
☆ Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, Humphrey Shi
Balancing temporal resolution and spatial detail under limited compute budget
remains a key challenge for video-based multi-modal large language models
(MLLMs). Existing methods typically compress video representations using
predefined rules before feeding them into the LLM, resulting in irreversible
information loss and often ignoring input instructions. To address this, we
propose a novel slow-fast architecture that naturally circumvents this
trade-off, enabling the use of more input frames while preserving spatial
details. Inspired by how humans first skim a video before focusing on relevant
parts, our slow-fast design employs a dual-token strategy: 1) "fast" visual
tokens -- a compact set of compressed video features -- are fed into the LLM
alongside text embeddings to provide a quick overview; 2) "slow" visual tokens
-- uncompressed video features -- are cross-attended by text embeddings through
specially designed hybrid decoder layers, enabling instruction-aware extraction
of relevant visual details with linear complexity. We conduct systematic
exploration to optimize both the overall architecture and key components.
Experiments show that our model significantly outperforms self-attention-only
baselines, extending the input capacity from 16 to 128 frames with just a 3%
increase in computation, and achieving a 16% average performance improvement
across five video understanding benchmarks. Our 7B model achieves
state-of-the-art performance among models of similar size. Furthermore, our
slow-fast architecture is a plug-and-play design that can be integrated into
other video MLLMs to improve efficiency and scalability.
comment: Technical report
☆ CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection
Cross-layer feature pyramid networks (CFPNs) have achieved notable progress
in multi-scale feature fusion and boundary detail preservation for salient
object detection. However, traditional CFPNs still suffer from two core
limitations: (1) a computational bottleneck caused by complex feature weighting
operations, and (2) degraded boundary accuracy due to feature blurring in the
upsampling process. To address these challenges, we propose CFMD, a novel
cross-layer feature pyramid network that introduces two key innovations. First,
we design a context-aware feature aggregation module (CFLMA), which
incorporates the state-of-the-art Mamba architecture to construct a dynamic
weight distribution mechanism. This module adaptively adjusts feature
importance based on image context, significantly improving both representation
efficiency and generalization. Second, we introduce an adaptive dynamic
upsampling unit (CFLMD) that preserves spatial details during resolution
recovery. By adjusting the upsampling range dynamically and initializing with a
bilinear strategy, the module effectively reduces feature overlap and maintains
fine-grained boundary structures. Extensive experiments on three standard
benchmarks using three mainstream backbone networks demonstrate that CFMD
achieves substantial improvements in pixel-level accuracy and boundary
segmentation quality, especially in complex scenes. The results validate the
effectiveness of CFMD in jointly enhancing computational efficiency and
segmentation performance, highlighting its strong potential in salient object
detection tasks.
☆ On Data Synthesis and Post-training for Visual Abstract Reasoning
This paper is a pioneering work attempting to address abstract visual
reasoning (AVR) problems for large vision-language models (VLMs). We make a
common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific
AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and
closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a
great breakthrough since almost all previous VLMs fail or show nearly random
performance on representative AVR benchmarks. Our key success is our innovative
data synthesis and post-training process, aiming to fully relieve the task
difficulty and elicit the model to learn, step by step. Our 7B model is also
shown to be behave well on AVR without sacrificing common multimodal
comprehension abilities. We hope our paper could serve as an early effort in
this area and would inspire further research in abstract visual reasoning.
☆ COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking
Transformer has recently demonstrated great potential in improving
vision-language (VL) tracking algorithms. However, most of the existing VL
trackers rely on carefully designed mechanisms to perform the multi-stage
multi-modal fusion. Additionally, direct multi-modal fusion without alignment
ignores distribution discrepancy between modalities in feature space,
potentially leading to suboptimal representations. In this work, we propose
COST, a contrastive one-stage transformer fusion framework for VL tracking,
aiming to learn semantically consistent and unified VL representations.
Specifically, we introduce a contrastive alignment strategy that maximizes
mutual information (MI) between a video and its corresponding language
description. This enables effective cross-modal alignment, yielding
semantically consistent features in the representation space. By leveraging a
visual-linguistic transformer, we establish an efficient multi-modal fusion and
reasoning mechanism, empirically demonstrating that a simple stack of
transformer encoders effectively enables unified VL representations. Moreover,
we contribute a newly collected VL tracking benchmark dataset for small object
tracking, named VL-SOT500, with bounding boxes and language descriptions. Our
dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated
to evaluating generic and high-speed small object tracking, respectively. Small
object tracking is notoriously challenging due to weak appearance and limited
features, and this dataset is, to the best of our knowledge, the first to
explore the usage of language cues to enhance visual representation for small
object tracking. Extensive experiments demonstrate that COST achieves
state-of-the-art performance on five existing VL tracking datasets, as well as
on our proposed VL-SOT500 dataset. Source codes and dataset will be made
publicly available.
comment: Preprint submitted to Elsevier.
https://github.com/983632847/Awesome-Multimodal-Object-Tracking
☆ Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Vision-Language Models (VLMs) extend the capabilities of Large Language
Models (LLMs) by incorporating visual information, yet they remain vulnerable
to jailbreak attacks, especially when processing noisy or corrupted images.
Although existing VLMs adopt security measures during training to mitigate such
attacks, vulnerabilities associated with noise-augmented visual inputs are
overlooked. In this work, we identify that missing noise-augmented training
causes critical security gaps: many VLMs are susceptible to even simple
perturbations such as Gaussian noise. To address this challenge, we propose
Robust-VLGuard, a multimodal safety dataset with aligned / misaligned
image-text pairs, combined with noise-augmented fine-tuning that reduces attack
success rates while preserving functionality of VLM. For stronger
optimization-based visual perturbation attacks, we propose DiffPure-VLM,
leveraging diffusion models to convert adversarial perturbations into
Gaussian-like noise, which can be defended by VLMs with noise-augmented safety
fine-tuning. Experimental results demonstrate that the distribution-shifting
property of diffusion model aligns well with our fine-tuned VLMs, significantly
mitigating adversarial perturbations across varying intensities. The dataset
and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.
☆ Direction-Aware Hybrid Representation Learning for 3D Hand Pose and Shape Estimation CVPR 2025
Most model-based 3D hand pose and shape estimation methods directly regress
the parametric model parameters from an image to obtain 3D joints under weak
supervision. However, these methods involve solving a complex optimization
problem with many local minima, making training difficult. To address this
challenge, we propose learning direction-aware hybrid features (DaHyF) that
fuse implicit image features and explicit 2D joint coordinate features. This
fusion is enhanced by the pixel direction information in the camera coordinate
system to estimate pose, shape, and camera viewpoint. Our method directly
predicts 3D hand poses with DaHyF representation and reduces jittering during
motion capture using prediction confidence based on contrastive learning. We
evaluate our method on the FreiHAND dataset and show that it outperforms
existing state-of-the-art methods by more than 33% in accuracy. DaHyF also
achieves the top ranking on both the HO3Dv2 and HO3Dv3 leaderboards for the
metric of Mean Joint Error (after scale and translation alignment). Compared to
the second-best results, the largest improvement observed is 10%. We also
demonstrate its effectiveness in real-time motion capture scenarios with hand
position variability, occlusion, and motion blur.
comment: Accepted to CVPR 2025 workshop
☆ BOLDSimNet: Examining Brain Network Similarity between Task and Resting-State fMRI
Traditional causal connectivity methods in task-based and resting-state
functional magnetic resonance imaging (fMRI) face challenges in accurately
capturing directed information flow due to their sensitivity to noise and
inability to model multivariate dependencies. These limitations hinder the
effective comparison of brain networks between cognitive states, making it
difficult to analyze network reconfiguration during task and resting states. To
address these issues, we propose BOLDSimNet, a novel framework utilizing
Multivariate Transfer Entropy (MTE) to measure causal connectivity and network
similarity across different cognitive states. Our method groups functionally
similar regions of interest (ROIs) rather than spatially adjacent nodes,
improving accuracy in network alignment. We applied BOLDSimNet to fMRI data
from 40 healthy controls and found that children exhibited higher similarity
scores between task and resting states compared to adolescents, indicating
reduced variability in attention shifts. In contrast, adolescents showed more
differences between task and resting states in the Dorsal Attention Network
(DAN) and the Default Mode Network (DMN), reflecting enhanced network
adaptability. These findings emphasize developmental variations in the
reconfiguration of the causal brain network, showcasing BOLDSimNet's ability to
quantify network similarity and identify attentional fluctuations between
different cognitive states.
☆ ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue
Recent advancements in visual odometry systems have improved autonomous
navigation; however, challenges persist in complex environments like forests,
where dense foliage, variable lighting, and repetitive textures compromise
feature correspondence accuracy. To address these challenges, we introduce
ForestGlue, enhancing the SuperPoint feature detector through four
configurations - grayscale, RGB, RGB-D, and stereo-vision - optimised for
various sensing modalities. For feature matching, we employ LightGlue or
SuperGlue, retrained with synthetic forest data. ForestGlue achieves comparable
pose estimation accuracy to baseline models but requires only 512 keypoints -
just 25% of the baseline's 2048 - to reach an LO-RANSAC AUC score of 0.745 at a
10{\deg} threshold. With only a quarter of keypoints needed, ForestGlue
significantly reduces computational overhead, demonstrating effectiveness in
dynamic forest environments, and making it suitable for real-time deployment on
resource-constrained platforms. By combining ForestGlue with a
transformer-based pose estimation model, we propose ForestVO, which estimates
relative camera poses using matched 2D pixel coordinates between frames. On
challenging TartanAir forest sequences, ForestVO achieves an average relative
pose error (RPE) of 1.09 m and a kitti_score of 2.33%, outperforming
direct-based methods like DSO by 40% in dynamic scenes. Despite using only 10%
of the dataset for training, ForestVO maintains competitive performance with
TartanVO while being a significantly lighter model. This work establishes an
end-to-end deep learning pipeline specifically tailored for visual odometry in
forested environments, leveraging forest-specific training data to optimise
feature correspondence and pose estimation, thereby enhancing the accuracy and
robustness of autonomous navigation systems.
comment: Accepted to the IEEE Robotics and Automation Letters
♻ ☆ Mr. DETR: Instructive Multi-Route Training for Detection Transformers CVPR 2025
Existing methods enhance the training of detection transformers by
incorporating an auxiliary one-to-many assignment. In this work, we treat the
model as a multi-task framework, simultaneously performing one-to-one and
one-to-many predictions. We investigate the roles of each component in the
transformer decoder across these two training targets, including
self-attention, cross-attention, and feed-forward network. Our empirical
results demonstrate that any independent component in the decoder can
effectively learn both targets simultaneously, even when other components are
shared. This finding leads us to propose a multi-route training mechanism,
featuring a primary route for one-to-one prediction and two auxiliary training
routes for one-to-many prediction. We enhance the training mechanism with a
novel instructive self-attention that dynamically and flexibly guides object
queries for one-to-many prediction. The auxiliary routes are removed during
inference, ensuring no impact on model architecture or inference cost. We
conduct extensive experiments on various baselines, achieving consistent
improvements as shown in Figure 1. Project page:
https://visual-ai.github.io/mrdetr
comment: Accepted by CVPR 2025, Project page:
https://visual-ai.github.io/mrdetr, Code:
https://github.com/Visual-AI/Mr.DETR
♻ ☆ Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Alice Luo, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Xiaodong Yang, Zhuolin Yang, Xiaohui Zeng, Zhe Zhang
Physical AI systems need to perceive, understand, and perform complex actions
in the physical world. In this paper, we present the Cosmos-Reason1 models that
can understand the physical world and generate appropriate embodied decisions
(e.g., next step action) in natural language through long chain-of-thought
reasoning processes. We begin by defining key capabilities for Physical AI
reasoning, with a focus on physical common sense and embodied reasoning. To
represent physical common sense, we use a hierarchical ontology that captures
fundamental knowledge about space, time, and physics. For embodied reasoning,
we rely on a two-dimensional ontology that generalizes across different
physical embodiments. Building on these capabilities, we develop two multimodal
large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data
and train our models in four stages: vision pre-training, general supervised
fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL)
as the post-training. To evaluate our models, we build comprehensive benchmarks
for physical common sense and embodied reasoning according to our ontologies.
Evaluation results show that Physical AI SFT and reinforcement learning bring
significant improvements. To facilitate the development of Physical AI, we will
make our code and pre-trained models available under the NVIDIA Open Model
License at https://github.com/nvidia-cosmos/cosmos-reason1.
♻ ☆ Meta ControlNet: Enhancing Task Adaptation via Meta Learning
Diffusion-based image synthesis has attracted extensive attention recently.
In particular, ControlNet that uses image-based prompts exhibits powerful
capability in image tasks such as canny edge detection and generates images
well aligned with these prompts. However, vanilla ControlNet generally requires
extensive training of around 5000 steps to achieve a desirable control for a
single task. Recent context-learning approaches have improved its adaptability,
but mainly for edge-based tasks, and rely on paired examples. Thus, two
important open issues are yet to be addressed to reach the full potential of
ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation
for non-edge-based tasks. In this paper, we introduce a novel Meta ControlNet
method, which adopts the task-agnostic meta learning technique and features a
new layer freezing design. Meta ControlNet significantly reduces learning steps
to attain control ability from 5000 to 1000. Further, Meta ControlNet exhibits
direct zero-shot adaptability in edge-based tasks without any finetuning, and
achieves control within only 100 finetuning steps in more complex non-edge
tasks such as Human Pose, outperforming all existing methods. The codes is
available in https://github.com/JunjieYang97/Meta-ControlNet.
comment: Codebase link: https://github.com/JunjieYang97/Meta-ControlNet
♻ ☆ DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling
Recent advances in text-to-3D creation integrate the potent prior of
Diffusion Models from text-to-image generation into 3D domain. Nevertheless,
generating 3D scenes with multiple objects remains challenging. Therefore, we
present DreamScape, a method for generating 3D scenes from text. Utilizing
Gaussian Splatting for 3D representation, DreamScape introduces 3D Gaussian
Guide that encodes semantic primitives, spatial transformations and
relationships from text using LLMs, enabling local-to-global optimization.
Progressive scale control is tailored during local object generation,
addressing training instability issue arising from simple blending in the
global optimization stage. Collision relationships between objects are modeled
at the global level to mitigate biases in LLMs priors, ensuring physical
correctness. Additionally, to generate pervasive objects like rain and snow
distributed extensively across the scene, we design specialized sparse
initialization and densification strategy. Experiments demonstrate that
DreamScape achieves state-of-the-art performance, enabling high-fidelity,
controllable 3D scene generation.
♻ ☆ EVOS: Efficient Implicit Neural Training via EVOlutionary Selector CVPR 2025
We propose EVOlutionary Selector (EVOS), an efficient training paradigm for
accelerating Implicit Neural Representation (INR). Unlike conventional INR
training that feeds all samples through the neural network in each iteration,
our approach restricts training to strategically selected points, reducing
computational overhead by eliminating redundant forward passes. Specifically,
we treat each sample as an individual in an evolutionary process, where only
those fittest ones survive and merit inclusion in training, adaptively evolving
with the neural network dynamics. While this is conceptually similar to
Evolutionary Algorithms, their distinct objectives (selection for acceleration
vs. iterative solution optimization) require a fundamental redefinition of
evolutionary mechanisms for our context. In response, we design sparse fitness
evaluation, frequency-guided crossover, and augmented unbiased mutation to
comprise EVOS. These components respectively guide sample selection with
reduced computational cost, enhance performance through frequency-domain
balance, and mitigate selection bias from cached evaluation. Extensive
experiments demonstrate that our method achieves approximately 48%-66%
reduction in training time while ensuring superior convergence without
additional cost, establishing state-of-the-art acceleration among recent
sampling-based strategies.
comment: Accepted by CVPR 2025
♻ ☆ SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation ICLR 2025
In recent years, the development of diffusion models has led to significant
progress in image and video generation tasks, with pre-trained models like the
Stable Diffusion series playing a crucial role. Inspired by model pruning which
lightens large pre-trained models by removing unimportant parameters, we
propose a novel model fine-tuning method to make full use of these ineffective
parameters and enable the pre-trained model with new task-specified
capabilities. In this work, we first investigate the importance of parameters
in pre-trained diffusion models, and discover that the smallest 10% to 20% of
parameters by absolute values do not contribute to the generation process.
Based on this observation, we propose a method termed SaRA that re-utilizes
these temporarily ineffective parameters, equating to optimizing a sparse
weight matrix to learn the task-specific knowledge. To mitigate overfitting, we
propose a nuclear-norm-based low-rank sparse training scheme for efficient
fine-tuning. Furthermore, we design a new progressive parameter adjustment
strategy to make full use of the re-trained/finetuned parameters. Finally, we
propose a novel unstructural backpropagation strategy, which significantly
reduces memory costs during fine-tuning. Our method enhances the generative
capabilities of pre-trained models in downstream applications and outperforms
traditional fine-tuning methods like LoRA in maintaining model's generalization
ability. We validate our approach through fine-tuning experiments on SD models,
demonstrating significant improvements. SaRA also offers a practical advantage
that requires only a single line of code modification for efficient
implementation and is seamlessly compatible with existing methods.
comment: Accepted by ICLR 2025
♻ ☆ Enhancing Implicit Neural Representations via Symmetric Power Transformation AAAI 2025
We propose symmetric power transformation to enhance the capacity of Implicit
Neural Representation~(INR) from the perspective of data transformation. Unlike
prior work utilizing random permutation or index rearrangement, our method
features a reversible operation that does not require additional storage
consumption. Specifically, we first investigate the characteristics of data
that can benefit the training of INR, proposing the Range-Defined Symmetric
Hypothesis, which posits that specific range and symmetry can improve the
expressive ability of INR. Based on this hypothesis, we propose a nonlinear
symmetric power transformation to achieve both range-defined and symmetric
properties simultaneously. We use the power coefficient to redistribute data to
approximate symmetry within the target range. To improve the robustness of the
transformation, we further design deviation-aware calibration and adaptive soft
boundary to address issues of extreme deviation boosting and continuity
breaking. Extensive experiments are conducted to verify the performance of the
proposed method, demonstrating that our transformation can reliably improve INR
compared with other data transformations. We also conduct 1D audio, 2D image
and 3D video fitting tasks to demonstrate the effectiveness and applicability
of our method.
comment: Accepted by AAAI 2025
♻ ☆ Target-Aware Video Diffusion Models
We present a target-aware video diffusion model that generates videos from an
input image in which an actor interacts with a specified target while
performing a desired action. The target is defined by a segmentation mask and
the desired action is described via a text prompt. Unlike existing controllable
image-to-video diffusion models that often rely on dense structural or motion
cues to guide the actor's movements toward the target, our target-aware model
requires only a simple mask to indicate the target, leveraging the
generalization capabilities of pretrained models to produce plausible actions.
This makes our method particularly effective for human-object interaction (HOI)
scenarios, where providing precise action guidance is challenging, and further
enables the use of video diffusion models for high-level action planning in
applications such as robotics. We build our target-aware model by extending a
baseline model to incorporate the target mask as an additional input. To
enforce target awareness, we introduce a special token that encodes the
target's spatial information within the text prompt. We then fine-tune the
model with our curated dataset using a novel cross-attention loss that aligns
the cross-attention maps associated with this token with the input target mask.
To further improve performance, we selectively apply this loss to the most
semantically relevant transformer blocks and attention regions. Experimental
results show that our target-aware model outperforms existing solutions in
generating videos where actors interact accurately with the specified targets.
We further demonstrate its efficacy in two downstream applications: video
content creation and zero-shot 3D HOI motion synthesis.
comment: The project page is available at https://taeksuu.github.io/tavid/
♻ ☆ Denoising Functional Maps: Diffusion Models for Shape Correspondence CVPR 2025
Estimating correspondences between pairs of deformable shapes remains a
challenging problem. Despite substantial progress, existing methods lack broad
generalization capabilities and require category-specific training data. To
address these limitations, we propose a fundamentally new approach to shape
correspondence based on denoising diffusion models. In our method, a diffusion
model learns to directly predict the functional map, a low-dimensional
representation of a point-wise map between shapes. We use a large dataset of
synthetic human meshes for training and employ two steps to reduce the number
of functional maps that need to be learned. First, the maps refer to a template
rather than shape pairs. Second, the functional map is defined in a basis of
eigenvectors of the Laplacian, which is not unique due to sign ambiguity.
Therefore, we introduce an unsupervised approach to select a specific basis by
correcting the signs of eigenvectors based on surface features. Our model
achieves competitive performance on standard human datasets, meshes with
anisotropic connectivity, non-isometric humanoid shapes, as well as animals
compared to existing descriptor-based and large-scale shape deformation
methods. See our project page for the source code and the datasets.
comment: CVPR 2025; Project page:
https://alekseizhuravlev.github.io/denoising-functional-maps/
♻ ☆ CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation CVPR 2025
Interleaved image-text generation has emerged as a crucial multimodal task,
aiming at creating sequences of interleaved visual and textual content given a
query. Despite notable advancements in recent multimodal large language models
(MLLMs), generating integrated image-text sequences that exhibit narrative
coherence and entity and style consistency remains challenging due to poor
training data quality. To address this gap, we introduce CoMM, a high-quality
Coherent interleaved image-text MultiModal dataset designed to enhance the
coherence, consistency, and alignment of generated multimodal content.
Initially, CoMM harnesses raw data from diverse sources, focusing on
instructional content and visual storytelling, establishing a foundation for
coherent and consistent content. To further refine the data quality, we devise
a multi-perspective filter strategy that leverages advanced pre-trained models
to ensure the development of sentences, consistency of inserted images, and
semantic alignment between them. Various quality evaluation metrics are
designed to prove the high quality of the filtered dataset. Meanwhile,
extensive few-shot experiments on various downstream tasks demonstrate CoMM's
effectiveness in significantly enhancing the in-context learning capabilities
of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved
generation abilities, supported by a comprehensive evaluation framework. We
believe CoMM opens a new avenue for advanced MLLMs with superior multimodal
in-context learning and understanding ability.
comment: 22 pages, Accepted by CVPR 2025
♻ ☆ DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation
Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a
training-free paradigm that can make use of adaptive temporal compression in
latent space. While existing video generative models apply fixed compression
rates via pretrained VAE, we observe that real-world video content exhibits
substantial temporal non-uniformity, with high-motion segments containing more
information than static scenes. Based on this insight, DLFR-VAE dynamically
adjusts the latent frame rate according to the content complexity.
Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent
Frame Rate Scheduler that partitions videos into temporal chunks and adaptively
determines optimal frame rates based on information-theoretic content
complexity, and (2) A training-free adaptation mechanism that transforms
pretrained VAE architectures into a dynamic VAE that can process features with
variable frame rates. Our simple but effective DLFR-VAE can function as a
plug-and-play module, seamlessly integrating with existing video generation
models and accelerating the video generation process.
♻ ☆ Why Autonomous Vehicles Are Not Ready Yet: A Multi-Disciplinary Review of Problems, Attempted Solutions, and Future Directions
Xingshuai Dong, Max Cappuccio, Hamad Al Jassmi, Fady Alnajjar, Essam Debie, Milad Ghasrikhouzani, Alessandro Lanteri, Ali Luqman, Tate McGregor, Oleksandra Molloy, Alice Plebe, Michael Regan, Dongmo Zhang
Personal autonomous vehicles are cars, trucks and bikes capable of sensing
their surrounding environment, planning their route, and driving with little or
no involvement of human drivers. Despite the impressive technological
achievements made by the industry in recent times and the hopeful announcements
made by leading entrepreneurs, to date no personal vehicle is approved for road
circulation in a 'fully' or 'semi' autonomous mode (autonomy levels 4 and 5)
and it is still unclear when such vehicles will eventually be mature enough to
receive this kind of approval. The present review adopts an integrative and
multidisciplinary approach to investigate the major challenges faced by the
automative sector, with the aim to identify the problems that still trouble and
delay the commercialization of autonomous vehicles. The review examines the
limitations and risks associated with current technologies and the most
promising solutions devised by the researchers. This negative assessment
methodology is not motivated by pessimism, but by the aspiration to raise
critical awareness about the technology's state-of-the-art, the industry's
quality standards, and the society's demands and expectations. While the survey
primarily focuses on the applications of artificial intelligence for perception
and navigation, it also aims to offer an enlarged picture that links the purely
technological aspects with the relevant human-centric aspects, including,
cultural attitudes, conceptual assumptions, and normative (ethico-legal)
frameworks. Examining the broader context serves to highlight problems that
have a cross-disciplinary scope and identify solutions that may benefit from a
holistic consideration.
comment: This manuscript extends the work "Applications of Computer Vision in
Autonomous Vehicles: Methods, Challenges, and Future Directions." We have
added several sections to explore autonomous vehicles from a
multidisciplinary perspective. We propose changing the arXiv category to
cs.RO, as the expanded content addresses broader autonomous vehicle topics
aligning more closely with the Robotics domain
♻ ☆ Towards Physically Plausible Video Generation via VLM Planning
Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia
Video diffusion models (VDMs) have advanced significantly in recent years,
enabling the generation of highly realistic videos and drawing the attention of
the community in their potential as world simulators. However, despite their
capabilities, VDMs often fail to produce physically plausible videos due to an
inherent lack of understanding of physics, resulting in incorrect dynamics and
event sequences. To address this limitation, we propose a novel two-stage
image-to-video generation framework that explicitly incorporates physics. In
the first stage, we employ a Vision Language Model (VLM) as a coarse-grained
motion planner, integrating chain-of-thought and physics-aware reasoning to
predict a rough motion trajectories/changes that approximate real-world
physical dynamics while ensuring the inter-frame consistency. In the second
stage, we use the predicted motion trajectories/changes to guide the video
generation of a VDM. As the predicted motion trajectories/changes are rough,
noise is added during inference to provide freedom to the VDM in generating
motion with more fine details. Extensive experimental results demonstrate that
our framework can produce physically plausible motion, and comparative
evaluations highlight the notable superiority of our approach over existing
methods. More video results are available on our Project Page:
https://madaoer.github.io/projects/physically_plausible_video_generation.
comment: 18 pages, 11 figures
♻ ☆ FriendNet: Detection-Friendly Dehazing Network
Adverse weather conditions often impair the quality of captured images,
inevitably inducing cutting-edge object detection models for advanced driver
assistance systems (ADAS) and autonomous driving. In this paper, we raise an
intriguing question: can the combination of image restoration and object
detection enhance detection performance in adverse weather conditions? To
answer it, we propose an effective architecture that bridges image dehazing and
object detection together via guidance information and task-driven learning to
achieve detection-friendly dehazing, termed FriendNet. FriendNet aims to
deliver both high-quality perception and high detection capacity. Different
from existing efforts that intuitively treat image dehazing as pre-processing,
FriendNet establishes a positive correlation between these two tasks. Clean
features generated by the dehazing network potentially contribute to
improvements in object detection performance. Conversely, object detection
crucially guides the learning process of the image dehazing network under the
task-driven learning scheme. We shed light on how downstream tasks can guide
upstream dehazing processes, considering both network architecture and learning
objectives. We design Guidance Fusion Block (GFB) and Guidance Attention Block
(GAB) to facilitate the integration of detection information into the network.
Furthermore, the incorporation of the detection task loss aids in refining the
optimization process. Additionally, we introduce a new Physics-aware Feature
Enhancement Block (PFEB), which integrates physics-based priors to enhance the
feature extraction and representation capabilities. Extensive experiments on
synthetic and real-world datasets demonstrate the superiority of our method
over state-of-the-art methods on both image quality and detection precision.
Our source code is available at https://github.com/fanyihua0309/FriendNet.
comment: We identified a fundamental flaw in the theoretical framework of this
submission, rendering the main argument unsound. To maintain academic rigor,
we request withdrawal and will submit a revised version after thorough
validation
♻ ☆ Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection CVPR 2025
Recent studies highlighted a practical setting of unsupervised anomaly
detection (UAD) that builds a unified model for multi-class images. Despite
various advancements addressing this challenging task, the detection
performance under the multi-class setting still lags far behind
state-of-the-art class-separated models. Our research aims to bridge this
substantial performance gap. In this paper, we introduce a minimalistic
reconstruction-based anomaly detection framework, namely Dinomaly, which
leverages pure Transformer architectures without relying on complex designs,
additional modules, or specialized tricks. Given this powerful framework
consisted of only Attentions and MLPs, we found four simple components that are
essential to multi-class anomaly detection: (1) Foundation Transformers that
extracts universal and discriminative features, (2) Noisy Bottleneck where
pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention
that naturally cannot focus, and (4) Loose Reconstruction that does not force
layer-to-layer and point-by-point reconstruction. Extensive experiments are
conducted across popular anomaly detection benchmarks including MVTec-AD, VisA,
and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of
99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only
superior to state-of-the-art multi-class UAD methods, but also achieves the
most advanced class-separated UAD records.
comment: IEEE/CVF CVPR 2025
♻ ☆ Scale-adaptive UAV Geo-localization via Height-aware Partition Learning
UAV Geo-Localization faces significant challenges due to the drastic
appearance discrepancy between dronecaptured images and satellite views.
Existing methods typically assume a consistent scaling factor across views and
rely on predefined partition alignment to extract viewpoint-invariant
representations through part-level feature construction. However, this scaling
assumption often fails in real-world scenarios, where variations in drone
flight states lead to scale mismatches between cross-view images, resulting in
severe performance degradation. To address this issue, we propose a
scale-adaptive partition learning framework that leverages known drone flight
height to predict scale factors and dynamically adjust feature extraction. Our
key contribution is a height-aware adjustment strategy, which calculates the
relative height ratio between drone and satellite views, dynamically adjusting
partition sizes to explicitly align semantic information between partition
pairs. This strategy is integrated into a Scale-adaptive Local Partition
Network (SaLPN), building upon an existing square partition strategy to extract
both finegrained and global features. Additionally, we propose a saliencyguided
refinement strategy to enhance part-level features, further improving retrieval
accuracy. Extensive experiments validate that our height-aware, scale-adaptive
approach achieves stateof-the-art geo-localization accuracy in various
scale-inconsistent scenarios and exhibits strong robustness against scale
variations. The code will be made publicly available.
comment: In Peer Review
♻ ☆ SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model
Speech enhancement plays an essential role in various applications, and the
integration of visual information has been demonstrated to bring substantial
advantages. However, the majority of current research concentrates on the
examination of facial and lip movements, which can be compromised or entirely
inaccessible in scenarios where occlusions occur or when the camera view is
distant. Whereas contextual visual cues from the surrounding environment have
been overlooked: for example, when we see a dog bark, our brain has the innate
ability to discern and filter out the barking noise. To this end, in this
paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is
the first proposal to use rich contextual information from synchronized video
as auxiliary cues to indicate the type of noise, which eventually improves the
speech enhancement performance. Specifically, we propose the VC-S$^2$E method,
which incorporates the Conformer and Mamba modules for their complementary
strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and
AudioSet datasets, where the results demonstrate the superiority of VC-S$^2$E
over other competitive methods. We will make the source code publicly
available. Project demo page: https://AVSEPage.github.io/
comment: accepted by IEEE Journal of Selected Topics in Signal Processing
♻ ☆ Efficient 3D Recognition with Event-driven Spike Sparse Convolution AAAI 2025
Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D
spatio-temporal features. Point clouds are sparse 3D spatial data, which
suggests that SNNs should be well-suited for processing them. However, when
applying SNNs to point clouds, they often exhibit limited performance and fewer
application scenarios. We attribute this to inappropriate preprocessing and
feature extraction methods. To address this issue, we first introduce the Spike
Voxel Coding (SVC) scheme, which encodes the 3D point clouds into a sparse
spike train space, reducing the storage requirements and saving time on point
cloud preprocessing. Then, we propose a Spike Sparse Convolution (SSC) model
for efficiently extracting 3D sparse point cloud features. Combining SVC and
SSC, we design an efficient 3D SNN backbone (E-3DSNN), which is friendly with
neuromorphic hardware. For instance, SSC can be implemented on neuromorphic
chips with only minor modifications to the addressing function of vanilla spike
convolution. Experiments on ModelNet40, KITTI, and Semantic KITTI datasets
demonstrate that E-3DSNN achieves state-of-the-art (SOTA) results with
remarkable efficiency. Notably, our E-3DSNN (1.87M) obtained 91.7\% top-1
accuracy on ModelNet40, surpassing the current best SNN baselines (14.3M) by
3.0\%. To our best knowledge, it is the first direct training 3D SNN backbone
that can simultaneously handle various 3D computer vision tasks (e.g.,
classification, detection, and segmentation) with an event-driven nature. Code
is available: https://github.com/bollossom/E-3DSNN/.
comment: Accepted by AAAI 2025
♻ ☆ Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
We study the task of language-conditioned pick and place in clutter, where a
robot should grasp a target object in open clutter and move it to a specified
place. Some approaches learn end-to-end policies with features from vision
foundation models, requiring large datasets. Others combine foundation models
in a zero-shot setting, suffering from cascading errors. In addition, they
primarily leverage vision and language foundation models, focusing less on
action priors. In this paper, we aim to develop an effective policy by
integrating foundation priors from vision, language, and action. We propose
A$^2$, an action prior alignment method that aligns unconditioned action priors
with 3D vision-language priors by learning one attention layer. The alignment
formulation enables our policy to train with less data and preserve zero-shot
generalization capabilities. We show that a shared policy for both pick and
place actions enhances the performance for each task, and introduce a policy
adaptation scheme to accommodate the multi-modal nature of actions. Extensive
experiments in simulation and the real-world show that our policy achieves
higher task success rates with fewer steps for both pick and place tasks in
clutter, effectively generalizing to unseen objects and language instructions.
Videos and codes are available at https://xukechun.github.io/papers/A2.
♻ ☆ FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning
task requiring intelligent systems to answer natural language queries based on
paired audio-video inputs accurately. However, existing AVQA approaches often
suffer from overfitting to dataset biases, leading to poor robustness.
Moreover, current datasets may not effectively diagnose these methods. To
address these challenges, we first introduce a novel dataset, FortisAVQA,
constructed in two stages: (1) rephrasing questions in the test split of the
public MUSIC-AVQA dataset and (2) introducing distribution shifts across
questions. The first stage expands the test space with greater diversity, while
the second enables a refined robustness evaluation across rare, frequent, and
overall question distributions. Second, we introduce a robust Multimodal
Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle
collaborative debiasing strategy to mitigate bias learning. Experimental
results demonstrate that our architecture achieves state-of-the-art performance
on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies
on both datasets validate the effectiveness of our debiasing components.
Additionally, our evaluation reveals the limited robustness of existing
multimodal QA methods. We also verify the plug-and-play capability of our
strategy by integrating it with various baseline models across both datasets.
Our dataset and code are available at https://github.com/reml-group/fortisavqa.
comment: Under Review
♻ ☆ Underwater Camouflaged Object Tracking Meets Vision-Language SAM2
Over the past decade, significant progress has been made in visual object
tracking, largely due to the availability of large-scale datasets. However,
these datasets have primarily focused on open-air scenarios and have largely
overlooked underwater animal tracking-especially the complex challenges posed
by camouflaged marine animals. To bridge this gap, we take a step forward by
proposing the first large-scale multi-modal underwater camouflaged object
tracking dataset, namely UW-COT220. Based on the proposed dataset, this work
first comprehensively evaluates current advanced visual object tracking
methods, including SAM- and SAM2-based trackers, in challenging underwater
environments, \eg, coral reefs. Our findings highlight the improvements of SAM2
over SAM, demonstrating its enhanced ability to handle the complexities of
underwater camouflaged objects. Furthermore, we propose a novel vision-language
tracking framework called VL-SAM2, based on the video foundation model SAM2.
Experimental results demonstrate that our VL-SAM2 achieves state-of-the-art
performance on the UW-COT220 dataset. The dataset and codes are available
at~\href{https://github.com/983632847/Awesome-Multimodal-Object-Tracking}{\color{magenta}{here}}.
comment: Preprint.
https://github.com/983632847/Awesome-Multimodal-Object-Tracking
♻ ☆ Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving
Perceiving the environment and its changes over time corresponds to two
fundamental yet heterogeneous types of information: semantics and motion.
Previous end-to-end autonomous driving works represent both types of
information in a single feature vector. However, including motion related
tasks, such as prediction and planning, impairs detection and tracking
performance, a phenomenon known as negative transfer in multi-task learning. To
address this issue, we propose Neural-Bayes motion decoding, a novel parallel
detection, tracking, and prediction method that separates semantic and motion
learning. Specifically, we employ a set of learned motion queries that operate
in parallel with detection and tracking queries, sharing a unified set of
recursively updated reference points. Moreover, we employ interactive semantic
decoding to enhance information exchange in semantic tasks, promoting positive
transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive
confirm the effectiveness of our divide and merge approach, resulting in
performance improvements across perception, prediction, and planning. Our code
is available at https://github.com/shenyinzhe/DMAD.
♻ ☆ Autonomous AI for Multi-Pathology Detection in Chest X-Rays: A Multi-Site Study in the Indian Healthcare System
Bargava Subramanian, Shajeev Jaikumar, Praveen Shastry, Naveen Kumarasami, Kalyan Sivasailam, Anandakumar D, Keerthana R, Mounigasri M, Kishore Prasath Venkatesh
Study Design: The study outlines the development of an autonomous AI system
for chest X-ray (CXR) interpretation, trained on a vast dataset of over 5
million X rays sourced from healthcare systems across India. This AI system
integrates advanced architectures including Vision Transformers, Faster R-CNN,
and various U Net models (such as Attention U-Net, U-Net++, and Dense U-Net) to
enable comprehensive classification, detection, and segmentation of 75 distinct
pathologies. To ensure robustness, the study design includes subgroup analyses
across age, gender, and equipment type, validating the model's adaptability and
performance across diverse patient demographics and imaging environments.
Performance: The AI system achieved up to 98% precision and over 95% recall
for multi pathology classification, with stable performance across demographic
and equipment subgroups. For normal vs. abnormal classification, it reached
99.8% precision, 99.6% recall, and 99.9% negative predictive value (NPV). It
was deployed in 17 major healthcare systems in India including diagnostic
centers, large hospitals, and government hospitals. Over the deployment period,
the system processed over 150,000 scans, averaging 2,000 chest X rays daily,
resulting in reduced reporting times and improved diagnostic accuracy.
Conclusion: The high precision and recall validate the AI's capability as a
reliable tool for autonomous normal abnormal classification, pathology
localization, and segmentation. This scalable AI model addresses diagnostic
gaps in underserved areas, optimizing radiology workflows and enhancing patient
care across diverse healthcare settings in India.
comment: 27 pages , 8 figures
♻ ☆ Muographic Image Upsampling with Machine Learning for Built Infrastructure Applications
The civil engineering industry faces a critical need for innovative
non-destructive evaluation methods, particularly for ageing critical
infrastructure, such as bridges, where current techniques fall short.
Muography, a non-invasive imaging technique, constructs three-dimensional
density maps by detecting interactions of naturally occurring cosmic-ray muons
within the scanned volume. Cosmic-ray muons provide deep penetration and
inherent safety due to their high momenta and natural source. However, the
technology's reliance on this source results in constrained muon flux, leading
to prolonged acquisition times, noisy reconstructions and image interpretation
challenges. To address these limitations, we developed a two-model deep
learning approach. First, we employed a conditional Wasserstein generative
adversarial network with gradient penalty (cWGAN-GP) to perform predictive
upsampling of undersampled muography images. Using the Structural Similarity
Index Measure (SSIM), 1-day sampled images matched the perceptual qualities of
a 21-day image, while the Peak Signal-to-Noise Ratio (PSNR) indicated noise
improvement equivalent to 31 days of sampling. A second cWGAN-GP model, trained
for semantic segmentation, quantitatively assessed the upsampling model's
impact on concrete sample features. This model achieved segmentation of rebar
grids and tendon ducts, with Dice-S{\o}rensen accuracy coefficients of 0.8174
and 0.8663. Notably, it could mitigate or remove z-plane smearing artifacts
caused by muography's inverse imaging problem. Both models were trained on a
comprehensive Geant4 Monte-Carlo simulation dataset reflecting realistic civil
infrastructure scenarios. Our results demonstrate significant improvements in
acquisition speed and image quality, marking a substantial step toward making
muography more practical for reinforced concrete infrastructure monitoring
applications.
♻ ☆ Pairwise-Constrained Implicit Functions for 3D Human Heart Modelling
Accurate 3D models of the human heart require not only correct outer surfaces
but also realistic inner structures, such as the ventricles, atria, and
myocardial layers. Approaches relying on implicit surfaces, such as signed
distance functions (SDFs), are primarily designed for single watertight
surfaces, making them ill-suited for multi-layered anatomical structures. They
often produce gaps or overlaps in shared boundaries. Unsigned distance
functions (UDFs) can model non-watertight geometries but are harder to
optimize, while voxel-based methods are limited in resolution and struggle to
produce smooth, anatomically realistic surfaces. We introduce a
pairwise-constrained SDF approach that models the heart as a set of
interdependent SDFs, each representing a distinct anatomical component. By
enforcing proper contact between adjacent SDFs, we ensure that they form
anatomically correct shared walls, preserving the internal structure of the
heart and preventing overlaps, or unwanted gaps. Our method significantly
improves inner structure accuracy over single-SDF, UDF-based, voxel-based, and
segmentation-based reconstructions. We further demonstrate its generalizability
by applying it to a vertebrae dataset, preventing unwanted contact between
structures.
♻ ☆ AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities
Geospatial models must adapt to the diversity of Earth observation data in
terms of resolutions, scales, and modalities. However, existing approaches
expect fixed input configurations, which limits their practical applicability.
We propose AnySat, a multimodal model based on joint embedding predictive
architecture (JEPA) and scale-adaptive spatial encoders, allowing us to train a
single model on highly heterogeneous data in a self-supervised manner. To
demonstrate the advantages of this unified approach, we compile GeoPlex, a
collection of $5$ multimodal datasets with varying characteristics and $11$
distinct sensors. We then train a single powerful model on these diverse
datasets simultaneously. Once fine-tuned or probed, we reach state-of-the-art
results on the test sets of GeoPlex and for $6$ external datasets across
various environment monitoring tasks: land cover mapping, tree species
identification, crop type classification, change detection, climate type
classification, and segmentation of flood, burn scar, and deforestation. The
code and models are available at https://github.com/gastruc/AnySat.
♻ ☆ Sparse Dictionary Learning for Image Recovery by Iterative Shrinkage
In this paper we study the sparse coding problem in the context of sparse
dictionary learning for image recovery. To this end, we consider and compare
several state-of-the-art sparse optimization methods constructed using the
shrinkage operation. As the mathematical setting of these methods, we consider
an online approach as algorithmical basis together with the basis pursuit
denoising problem that arises by the convex optimization approach to the
dictionary learning problem.
By a dedicated construction of datasets and corresponding dictionaries, we
study the effect of enlarging the underlying learning database on
reconstruction quality making use of several error measures. Our study
illuminates that the choice of the optimization method may be practically
important in the context of availability of training data. In the context of
different settings for training data as may be considered part of our study, we
illuminate the computational efficiency of the assessed optimization methods.
comment: 19 pages, 5 Figures, IntelliSys 2025
♻ ☆ SAM-REF: Introducing Image-Prompt Synergy during Interaction for Detail Enhancement in the Segment Anything Model
Interactive segmentation is to segment the mask of the target object
according to the user's interactive prompts. There are two mainstream
strategies: early fusion and late fusion. Current specialist models utilize the
early fusion strategy that encodes the combination of images and prompts to
target the prompted objects, yet repetitive complex computations on the images
result in high latency. Late fusion models extract image embeddings once and
merge them with the prompts in later interactions. This strategy avoids
redundant image feature extraction and improves efficiency significantly. A
recent milestone is the Segment Anything Model (SAM). However, this strategy
limits the models' ability to extract detailed information from the prompted
target zone. To address this issue, we propose SAM-REF, a two-stage refinement
framework that fully integrates images and prompts by using a lightweight
refiner into the interaction of late fusion, which combines the accuracy of
early fusion and maintains the efficiency of late fusion. Through extensive
experiments, we show that our SAM-REF model outperforms the current
state-of-the-art method in most metrics on segmentation quality without
compromising efficiency.
♻ ☆ Loong: Generating Minute-level Long Videos with Autoregressive Language Models
It is desirable but challenging to generate content-rich long videos in the
scale of minutes. Autoregressive large language models (LLMs) have achieved
great success in generating coherent and long sequences of tokens in the domain
of natural language processing, while the exploration of autoregressive LLMs
for video generation is limited to generating short videos of several seconds.
In this work, we conduct a deep analysis of the challenges that prevent
autoregressive LLM-based video generators from generating long videos. Based on
the observations and analysis, we propose Loong, a new autoregressive LLM-based
video generator that can generate minute-long videos. Specifically, we model
the text tokens and video tokens as a unified sequence for autoregressive LLMs
and train the model from scratch. We propose progressive short-to-long training
with a loss re-weighting scheme to mitigate the loss imbalance problem for long
video training. We further investigate inference strategies, including video
token re-encoding and sampling strategies, to diminish error accumulation
during inference. Our proposed Loong can be trained on 10-second videos and be
extended to generate minute-level long videos conditioned on text prompts, as
demonstrated by the results. More samples are available at:
https://yuqingwang1029.github.io/Loong-video.
comment: Project page: https://yuqingwang1029.github.io/Loong-video
♻ ☆ Dynamic Proxy Domain Generalizes the Crowd Localization by Better Binary Segmentation
Crowd localization targets on predicting each instance precise location
within an image. Current advanced methods propose the pixel-wise binary
classification to tackle the congested prediction, in which the pixel-level
thresholds binarize the prediction confidence of being the pedestrian head.
Since the crowd scenes suffer from extremely varying contents, counts and
scales, the confidence-threshold learner is fragile and under-generalized
encountering domain knowledge shift. Moreover, at the most time, the target
domain is agnostic in training. Hence, it is imperative to exploit how to
enhance the generalization of confidence-threshold locator to the latent target
domain. In this paper, we propose a Dynamic Proxy Domain (DPD) method to
generalize the learner under domain shift. Concretely, based on the theoretical
analysis to the generalization error risk upper bound on the latent target
domain to a binary classifier, we propose to introduce a generated proxy domain
to facilitate generalization. Then, based on the theory, we design a DPD
algorithm which is composed by a training paradigm and proxy domain generator
to enhance the domain generalization of the confidence-threshold learner.
Besides, we conduct our method on five kinds of domain shift scenarios,
demonstrating the effectiveness on generalizing the crowd localization. Our
code will be available at https://github.com/zhangda1018/DPD.
♻ ☆ Parallelized Autoregressive Visual Generation CVPR 2025
Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu
Autoregressive models have emerged as a powerful approach for visual
generation but suffer from slow inference speed due to their sequential
token-by-token prediction process. In this paper, we propose a simple yet
effective approach for parallelized autoregressive visual generation that
improves generation efficiency while preserving the advantages of
autoregressive modeling. Our key insight is that parallel generation depends on
visual token dependencies-tokens with weak dependencies can be generated in
parallel, while strongly dependent adjacent tokens are difficult to generate
together, as their independent sampling may lead to inconsistencies. Based on
this observation, we develop a parallel generation strategy that generates
distant tokens with weak dependencies in parallel while maintaining sequential
generation for strongly dependent local tokens. Our approach can be seamlessly
integrated into standard autoregressive models without modifying the
architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that
our method achieves a 3.6x speedup with comparable quality and up to 9.5x
speedup with minimal quality degradation across both image and video generation
tasks. We hope this work will inspire future research in efficient visual
generation and unified autoregressive modeling. Project page:
https://yuqingwang1029.github.io/PAR-project.
comment: CVPR 2025 Accepted - Project Page:
https://yuqingwang1029.github.io/PAR-project
♻ ☆ Adapting Video Diffusion Models for Time-Lapse Microscopy
We present a domain adaptation of video diffusion models to generate highly
realistic time-lapse microscopy videos of cell division in HeLa cells. Although
state-of-the-art generative video models have advanced significantly for
natural videos, they remain underexplored in microscopy domains. To address
this gap, we fine-tune a pretrained video diffusion model on
microscopy-specific sequences, exploring three conditioning strategies: (1)
text prompts derived from numeric phenotypic measurements (e.g., proliferation
rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings
of phenotype scores, and (3) image-conditioned generation, where an initial
microscopy frame is extended into a complete video sequence. Evaluation using
biologically meaningful morphological, proliferation, and migration metrics
demonstrates that fine-tuning substantially improves realism and accurately
captures critical cellular behaviors such as mitosis and migration. Notably,
the fine-tuned model also generalizes beyond the training horizon, generating
coherent cell dynamics even in extended sequences. However, precisely
controlling specific phenotypic characteristics remains challenging,
highlighting opportunities for future work to enhance conditioning methods. Our
results demonstrate the potential for domain-specific fine-tuning of generative
video models to produce biologically plausible synthetic microscopy data,
supporting applications such as in-silico hypothesis testing and data
augmentation.
♻ ☆ Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
DeepSeek series have demonstrated outstanding performance in general scene
understanding, question-answering (QA), and text generation tasks, owing to its
efficient training paradigm and strong reasoning capabilities. In this study,
we investigate the dialogue capabilities of the DeepSeek model in robotic
surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and
Detailed Description. The Single Phrase QA tasks further include sub-tasks such
as surgical instrument recognition, action understanding, and spatial position
analysis. We conduct extensive evaluations using publicly available datasets,
including EndoVis18 and CholecT50, along with their corresponding dialogue
data. Our comprehensive evaluation results indicate that, when provided with
specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue
recognition tasks However, DeepSeek-V3 exhibits significant limitations in
spatial position analysis and struggles to understand surgical actions
accurately. Additionally, our findings reveal that, under general prompts,
DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts
and fails to provide detailed insights into surgical scenarios. Based on our
observations, we argue that the DeepSeek-V3 is not ready for vision-language
tasks in surgical contexts without fine-tuning on surgery-specific datasets.
comment: Technical Report
♻ ☆ Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection ICLR 2025
In pursuit of detecting unstinted objects that extend beyond predefined
categories, prior arts of open-vocabulary object detection (OVD) typically
resort to pretrained vision-language models (VLMs) for base-to-novel category
generalization. However, to mitigate the misalignment between upstream
image-text pretraining and downstream region-level perception, additional
supervisions are indispensable, eg, image-text pairs or pseudo annotations
generated via self-training strategies. In this work, we propose CCKT-Det
trained without any extra supervision. The proposed framework constructs a
cyclic and dynamic knowledge transfer from language queries and visual region
features extracted from VLMs, which forces the detector to closely align with
the visual-semantic space of VLMs. Specifically, 1) we prefilter and inject
semantic priors to guide the learning of queries, and 2) introduce a regional
contrastive loss to improve the awareness of queries on novel objects. CCKT-Det
can consistently improve performance as the scale of VLMs increases, all while
requiring the detector at a moderate level of computation overhead.
Comprehensive experimental results demonstrate that our method achieves
performance gain of +2.9% and +10.2% AP50 over previous state-of-the-arts on
the challenging COCO benchmark, both without and with a stronger teacher model.
comment: 10 pages, 5 figures, Published as a conference paper at ICLR 2025
♻ ☆ VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models
Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, Filippos Kokkinos
Large Vision-Language Models (LVLMs) struggle with puzzles, which require
precise perception, rule comprehension, and logical reasoning. Assessing and
enhancing their performance in this domain is crucial, as it reflects their
ability to engage in structured reasoning - an essential skill for real-world
problem-solving. However, existing benchmarks primarily evaluate pre-trained
models without additional training or fine-tuning, often lack a dedicated focus
on reasoning, and fail to establish a systematic evaluation framework. To
address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning
Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple
difficulty levels, and includes extensive experiments not only on existing chat
LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our
results reveal that even the state-of-the-art LVLMs struggle with these
puzzles, highlighting fundamental limitations in their puzzle-solving
capabilities. Most importantly, through systematic experiments, we identify and
analyze key factors influencing LVLMs' puzzle-solving performance, including
the number of clues, grid size, and rule complexity. Furthermore, we explore
two Supervised Fine-Tuning (SFT) strategies that can be used in post-training:
SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT).
While both methods significantly improve performance on trained puzzles, they
exhibit limited generalization to unseen ones. We will release VGRP-Bench to
facilitate further research on LVLMs for complex, real-world problem-solving.
Project page: https://yufan-ren.com/subpage/VGRP-Bench/.
comment: 8 pages; Project page: https://yufan-ren.com/subpage/VGRP-Bench/
♻ ☆ EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters
Generating emotion-specific talking head videos from audio input is an
important and complex challenge for human-machine interaction. However, emotion
is highly abstract concept with ambiguous boundaries, and it necessitates
disentangled expression parameters to generate emotionally expressive talking
head videos. In this work, we present EmoHead to synthesize talking head videos
via semantic expression parameters. To predict expression parameter for
arbitrary audio input, we apply an audio-expression module that can be
specified by an emotion tag. This module aims to enhance correlation from audio
input across various emotions. Furthermore, we leverage pre-trained hyperplane
to refine facial movements by probing along the vertical direction. Finally,
the refined expression parameters regularize neural radiance fields and
facilitate the emotion-consistent generation of talking head videos.
Experimental results demonstrate that semantic expression parameters lead to
better reconstruction quality and controllability.
♻ ☆ Towards Calibrated Deep Clustering Network ICLR 2025
Deep clustering has exhibited remarkable performance; however, the over
confidence problem, i.e., the estimated confidence for a sample belonging to a
particular cluster greatly exceeds its actual prediction accuracy, has been
over looked in prior research. To tackle this critical issue, we pioneer the
development of a calibrated deep clustering framework. Specifically, we propose
a novel dual head (calibration head and clustering head) deep clustering model
that can effectively calibrate the estimated confidence and the actual
accuracy. The calibration head adjusts the overconfident predictions of the
clustering head, generating prediction confidence that matches the model
learning status. Then, the clustering head dynamically selects reliable
high-confidence samples estimated by the calibration head for pseudo-label
self-training. Additionally, we introduce an effective network initialization
strategy that enhances both training speed and network robustness. The
effectiveness of the proposed calibration approach and initialization strategy
are both endorsed with solid theoretical guarantees. Extensive experiments
demonstrate the proposed calibrated deep clustering model not only surpasses
the state-of-the-art deep clustering methods by 5x on average in terms of
expected calibration error, but also significantly outperforms them in terms of
clustering accuracy. The code is available at
https://github.com/ChengJianH/CDC.
comment: The paper is accepted by ICLR 2025
♻ ☆ ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting
Ruifeng Luo, Zhengjie Liu, Tianxiao Cheng, Jie Wang, Tongjie Wang, Xingguang Wei, Haomin Wang, YanPeng Li, Fu Chai, Fei Cheng, Shenglong Ye, Wenhai Wang, Yanting Zhang, Yu Qiao, Hongjie Zhang, Xianzhong Zhao
Recognizing symbols in architectural CAD drawings is critical for various
advanced engineering applications. In this paper, we propose a novel CAD data
annotation engine that leverages intrinsic attributes from systematically
archived CAD drawings to automatically generate high-quality annotations, thus
significantly reducing manual labeling efforts. Utilizing this engine, we
construct ArchCAD-400K, a large-scale CAD dataset consisting of 413,062 chunks
from 5538 highly standardized drawings, making it over 26 times larger than the
largest existing CAD dataset. ArchCAD-400K boasts an extended drawing diversity
and broader categories, offering line-grained annotations. Furthermore, we
present a new baseline model for panoptic symbol spotting, termed Dual-Pathway
Symbol Spotter (DPSS). It incorporates an adaptive fusion module to enhance
primitive features with complementary image features, achieving
state-of-the-art performance and enhanced robustness. Extensive experiments
validate the effectiveness of DPSS, demonstrating the value of ArchCAD-400K and
its potential to drive innovation in architectural design and construction.
♻ ☆ Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering CVPR 2025
Multimodal LLMs (MLLMs) are the natural extension of large language models to
handle multimodal inputs, combining text and image data. They have recently
garnered attention due to their capability to address complex tasks involving
both modalities. However, their effectiveness is limited to the knowledge
acquired during training, which restricts their practical utility. In this
work, we introduce a novel method to enhance the adaptability of MLLMs by
integrating external knowledge sources. Our proposed model, Reflective LLaVA
(ReflectiVA), utilizes reflective tokens to dynamically determine the need for
external knowledge and predict the relevance of information retrieved from an
external database. Tokens are trained following a two-stage two-model training
recipe. This ultimately enables the MLLM to manage external knowledge while
preserving fluency and performance on tasks where external knowledge is not
needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for
knowledge-based visual question answering, highlighting its superior
performance compared to existing methods. Source code and trained models are
publicly available at https://aimagelab.github.io/ReflectiVA.
comment: CVPR 2025
♻ ☆ NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer ICLR 2025
By harnessing the potent generative capabilities of pre-trained large video
diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS)
paradigm that operates \textit{without} the need for training. NVS-Solver
adaptively modulates the diffusion sampling process with the given views to
enable the creation of remarkable visual experiences from single or multiple
views of static scenes or monocular videos of dynamic scenes. Specifically,
built upon our theoretical modeling, we iteratively modulate the score function
with the given scene priors represented with warped input views to control the
video diffusion process. Moreover, by theoretically exploring the boundary of
the estimation error, we achieve the modulation in an adaptive fashion
according to the view pose and the number of diffusion steps. Extensive
evaluations on both static and dynamic scenes substantiate the significant
superiority of our NVS-Solver over state-of-the-art methods both quantitatively
and qualitatively. \textit{ Source code in }
\href{https://github.com/ZHU-Zhiyu/NVS_Solver}{https://github.com/ZHU-Zhiyu/NVS$\_$Solver}.
comment: ICLR 2025
♻ ☆ Modeling Visual Memorability Assessment with Autoencoders Reveals Characteristics of Memorable Images
Image memorability refers to the phenomenon where certain images are more
likely to be remembered than others. It is a quantifiable and intrinsic image
attribute, defined as the likelihood of an image being remembered upon a single
exposure. Despite advances in understanding human visual perception and memory,
it is unclear what features contribute to an image's memorability. To address
this question, we propose a deep learning-based computational modeling
approach. We employ an autoencoder-based approach built on VGG16 convolutional
neural networks (CNNs) to learn latent representations of images. The model is
trained in a single-epoch setting, mirroring human memory experiments that
assess recall after a single exposure. We examine the relationship between
autoencoder reconstruction error and memorability, analyze the distinctiveness
of latent space representations, and develop a multi-layer perceptron (MLP)
model for memorability prediction. Additionally, we perform interpretability
analysis using Integrated Gradients (IG) to identify the key visual
characteristics that contribute to memorability. Our results demonstrate a
significant correlation between the images' memorability score and the
autoencoder's reconstruction error, as well as the robust predictive
performance of its latent representations. Distinctiveness in these
representations correlated significantly with memorability. Additionally,
certain visual characteristics were identified as features contributing to
image memorability in our model. These findings suggest that autoencoder-based
representations capture fundamental aspects of image memorability, providing
new insights into the computational modeling of human visual memory.
♻ ☆ Adversarial Example Soups: Improving Transferability and Stealthiness for Free
Transferable adversarial examples cause practical security risks since they
can mislead a target model without knowing its internal knowledge. A
conventional recipe for maximizing transferability is to keep only the optimal
adversarial example from all those obtained in the optimization pipeline. In
this paper, for the first time, we revisit this convention and demonstrate that
those discarded, sub-optimal adversarial examples can be reused to boost
transferability. Specifically, we propose ``Adversarial Example Soups'' (AES),
with AES-tune for averaging discarded adversarial examples in hyperparameter
tuning and AES-rand for stability testing. In addition, our AES is inspired by
``model soups'', which averages weights of multiple fine-tuned models for
improved accuracy without increasing inference time. Extensive experiments
validate the global effectiveness of our AES, boosting 10 state-of-the-art
transfer attacks and their combinations by up to 13\% against 10 diverse
(defensive) target models. We also show the possibility of generalizing AES to
other types, \textit{e.g.}, directly averaging multiple in-the-wild adversarial
examples that yield comparable success. A promising byproduct of AES is the
improved stealthiness of adversarial examples since the perturbation variances
are naturally reduced.
comment: Accepted by TIFS 2025
♻ ☆ Repurposing SAM for User-Defined Semantics Aware Segmentation
The Segment Anything Model (SAM) excels at generating precise object masks
from input prompts but lacks semantic awareness, failing to associate its
generated masks with specific object categories. To address this limitation, we
propose U-SAM, a novel framework that imbibes semantic awareness into SAM,
enabling it to generate targeted masks for user-specified object categories.
Given only object class names as input from the user, U-SAM provides
pixel-level semantic annotations for images without requiring any
labeled/unlabeled samples from the test data distribution. Our approach
leverages synthetically generated or web crawled images to accumulate semantic
information about the desired object classes. We then learn a mapping function
between SAM's mask embeddings and object class labels, effectively enhancing
SAM with granularity-specific semantic recognition capabilities. As a result,
users can obtain meaningful and targeted segmentation masks for specific
objects they request, rather than generic and unlabeled masks. We evaluate
U-SAM on PASCAL VOC 2012 and MSCOCO-80, achieving significant mIoU improvements
of +17.95% and +5.20%, respectively, over state-of-the-art methods. By
transforming SAM into a semantically aware segmentation model, U-SAM offers a
practical and flexible solution for pixel-level annotation across diverse and
unseen domains in a resource-constrained environment.
♻ ☆ Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
Error detection in procedural activities is essential for consistent and
correct outcomes in AR-assisted and robotic systems. Existing methods often
focus on temporal ordering errors or rely on static prototypes to represent
normal actions. However, these approaches typically overlook the common
scenario where multiple, distinct actions are valid following a given sequence
of executed actions. This leads to two issues: (1) the model cannot effectively
detect errors using static prototypes when the inference environment or action
execution distribution differs from training; and (2) the model may also use
the wrong prototypes to detect errors if the ongoing action label is not the
same as the predicted one. To address this problem, we propose an Adaptive
Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all
valid next actions and reconstructs their corresponding normal action
representations, which are compared against the ongoing action to detect
errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art
performance, highlighting the effectiveness of AMNAR and the importance of
modeling multiple valid next actions in error detection. The code is available
at https://github.com/iSEE-Laboratory/AMNAR.
♻ ☆ Distilling Multi-view Diffusion Models into 3D Generators
We introduce DD3G, a formulation that Distills a multi-view Diffusion model
(MV-DM) into a 3D Generator using gaussian splatting. DD3G compresses and
integrates extensive visual and spatial geometric knowledge from the MV-DM by
simulating its ordinary differential equation (ODE) trajectory, ensuring the
distilled generator generalizes better than those trained solely on 3D data.
Unlike previous amortized optimization approaches, we align the MV-DM and 3D
generator representation spaces to transfer the teacher's probabilistic flow to
the student, thus avoiding inconsistencies in optimization objectives caused by
probabilistic sampling. The introduction of probabilistic flow and the coupling
of various attributes in 3D Gaussians introduce challenges in the generation
process. To tackle this, we propose PEPD, a generator consisting of Pattern
Extraction and Progressive Decoding phases, which enables efficient fusion of
probabilistic flow and converts a single image into 3D Gaussians within 0.06
seconds. Furthermore, to reduce knowledge loss and overcome sparse-view
supervision, we design a joint optimization objective that ensures the quality
of generated samples through explicit supervision and implicit verification.
Leveraging existing 2D generation models, we compile 120k high-quality RGBA
images for distillation. Experiments on synthetic and public datasets
demonstrate the effectiveness of our method. Our project is available at:
https://qinbaigao.github.io/DD3G_project/
♻ ☆ STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models CVPR-2025
The rapid proliferation of large-scale text-to-image diffusion (T2ID) models
has raised serious concerns about their potential misuse in generating harmful
content. Although numerous methods have been proposed for erasing undesired
concepts from T2ID models, they often provide a false sense of security;
concept-erased models (CEMs) can still be manipulated via adversarial attacks
to regenerate the erased concept. While a few robust concept erasure methods
based on adversarial training have emerged recently, they compromise on utility
(generation quality for benign concepts) to achieve robustness and/or remain
vulnerable to advanced embedding space attacks. These limitations stem from the
failure of robust CEMs to thoroughly search for "blind spots" in the embedding
space. To bridge this gap, we propose STEREO, a novel two-stage framework that
employs adversarial training as a first step rather than the only step for
robust concept erasure. In the first stage, STEREO employs adversarial training
as a vulnerability identification mechanism to search thoroughly enough. In the
second robustly erase once stage, STEREO introduces an anchor-concept-based
compositional objective to robustly erase the target concept in a single
fine-tuning stage, while minimizing the degradation of model utility. We
benchmark STEREO against seven state-of-the-art concept erasure methods,
demonstrating its superior robustness to both white-box and black-box attacks,
while largely preserving utility.
comment: Accepted to CVPR-2025. Code:
https://github.com/koushiksrivats/robust-concept-erasing
♻ ☆ VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
Recent image-to-video generation methods have demonstrated success in
enabling control over one or two visual elements, such as camera motion or
object motion. However, these methods are unable to offer control over multiple
visual elements due to limitations in data and network efficacy. In this paper,
we introduce VidCRAFT3, a novel framework for precise image-to-video generation
that enables control over camera motion, object motion, and lighting direction
simultaneously. VidCRAFT3 integrates three core components: Image2Cloud
generates 3D point cloud from a reference image; ObjMotionNet encodes sparse
object trajectories using multi-scale optical flow features; and Spatial
Triple-Attention Transformer incorporates lighting direction embeddings via
parallel cross-attention modules. Additionally, we introduce the
VideoLightingDirection dataset, providing synthetic yet realistic video clips
with accurate per-frame lighting direction annotations, effectively mitigating
the lack of annotated real-world datasets. We further adopt a three-stage
training strategy, ensuring robust learning even without joint multi-element
annotations. Extensive experiments show that VidCRAFT3 produces high-quality
video content, outperforming state-of-the-art methods in control granularity
and visual coherence. Code and data will be publicly available.
♻ ☆ A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition ICME2025
In the domain of multimodal intent recognition (MIR), the objective is to
recognize human intent by integrating a variety of modalities, such as language
text, body gestures, and tones. However, existing approaches face difficulties
adequately capturing the intrinsic connections between the modalities and
overlooking the corresponding semantic representations of intent. To address
these limitations, we present the Anchor-based Multimodal Embedding with
Semantic Synchronization (A-MESS) framework. We first design an Anchor-based
Multimodal Embedding (A-ME) module that employs an anchor-based embedding
fusion mechanism to integrate multimodal inputs. Furthermore, we develop a
Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning
pipeline, which optimizes the process by synchronizing multimodal
representation with label descriptions produced by the large language model.
Comprehensive experiments indicate that our A-MESS achieves state-of-the-art
and provides substantial insight into multimodal representation and downstream
tasks.
comment: Accepted by ICME2025
♻ ☆ Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset CVPR 2025
Yiqun Mei, Mingming He, Li Ma, Julien Philip, Wenqi Xian, David M George, Xueming Yu, Gabriel Dedic, Ahmet Levent Taşel, Ning Yu, Vishal M. Patel, Paul Debevec
Video portrait relighting remains challenging because the results need to be
both photorealistic and temporally stable. This typically requires a strong
model design that can capture complex facial reflections as well as intensive
training on a high-quality paired video dataset, such as dynamic
one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel
portrait video relighting method that produces both photorealistic and
temporally consistent lighting effects. From the model side, we design a new
conditional video diffusion model built upon state-of-the-art pre-trained video
diffusion model, alongside a new lighting injection mechanism to enable precise
control. This way we leverage strong spatial and temporal generative capability
to generate plausible solutions to the ill-posed relighting problem. Our
technique uses a hybrid dataset consisting of static expression OLAT data and
in-the-wild portrait performance videos to jointly learn relighting and
temporal modeling. This avoids the need to acquire paired video data in
different lighting conditions. Our extensive experiments show that our model
produces state-of-the-art results both in terms of photorealism and temporal
consistency.
comment: CVPR 2025