Computer Vision and Pattern Recognition 157
☆ Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
While Multimodal Large Language Models (MLLMs) excel at holistic
understanding, they struggle in capturing the dense world with complex scenes,
requiring fine-grained analysis of intricate details and object
inter-relationships. Region-level MLLMs have been a promising step. However,
previous attempts are generally optimized to understand given regions in
isolation, neglecting crucial global contexts. To address this, we introduce
Grasp Any Region (GAR) for comprehen- sive region-level visual understanding.
Empowered by an effective RoI-aligned feature replay technique, GAR supports
(1) precise perception by leveraging necessary global contexts, and (2)
modeling interactions between multiple prompts. Together, it then naturally
achieves (3) advanced compositional reasoning to answer specific free-form
questions about any region, shifting the paradigm from passive description to
active dialogue. Moreover, we construct GAR-Bench, which not only provides a
more accurate evaluation of single-region comprehension, but also, more
importantly, measures interactions and complex reasoning across multiple
regions. Extensive experiments have demonstrated that GAR-1B not only maintains
the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5
on DLC-Bench, but also excels at modeling relationships between multiple
prompts with advanced comprehension capabilities, even surpassing InternVL3-78B
on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms
in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong
capabilities can be easily transferred to videos.
☆ DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Reasoning about dynamic spatial relationships is essential, as both observers
and objects often move simultaneously. Although vision-language models (VLMs)
and visual expertise models excel in 2D tasks and static scenarios, their
ability to fully understand dynamic 3D scenarios remains limited. We introduce
Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly
1,000 dynamic videos and over 1,700 manually annotated questions covering nine
decoupled motion patterns of observers and objects. Spatially and temporally
symmetric designs reduce biases and enable systematic evaluation of models'
reasoning about self-motion and object motion. Our evaluation of 14 VLMs and
expert models reveals key limitations: models often conflate observer and
object motion, exhibit semantic biases, and fail to accurately infer relative
relationships in dynamic scenarios. Our DSI-Bench provides valuable findings
and insights about the future development of general and expertise models with
dynamic spatial intelligence.
☆ LightMem: Lightweight and Efficient Memory-Augmented Generation
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang
Despite their remarkable capabilities, Large Language Models (LLMs) struggle
to effectively leverage historical interaction information in dynamic and
complex environments. Memory systems enable LLMs to move beyond stateless
interactions by introducing persistent information storage, retrieval, and
utilization mechanisms. However, existing memory systems often introduce
substantial time and computational overhead. To this end, we introduce a new
memory system called LightMem, which strikes a balance between the performance
and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of
human memory, LightMem organizes memory into three complementary stages. First,
cognition-inspired sensory memory rapidly filters irrelevant information
through lightweight compression and groups information according to their
topics. Next, topic-aware short-term memory consolidates these topic-based
groups, organizing and summarizing content for more structured access. Finally,
long-term memory with sleep-time update employs an offline procedure that
decouples consolidation from online inference. Experiments on LongMemEval with
GPT and Qwen backbones show that LightMem outperforms strong baselines in
accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API
calls by up to 159x, and runtime by over 12x. The code is available at
https://github.com/zjunlp/LightMem.
comment: Work in progress
★ DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution NeurIPS 2025
Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world
image super-resolution (Real-ISR) methods can synthesize rich and realistic
details. However, due to the inherent stochasticity of T2I models, different
noise inputs often lead to outputs with varying perceptual quality. Although
this randomness is sometimes seen as a limitation, it also introduces a wider
perceptual quality range, which can be exploited to improve Real-ISR
performance. To this end, we introduce Direct Perceptual Preference
Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative
models with perceptual preferences without requiring costly human annotations.
We construct a hybrid reward signal by combining full-reference and
no-reference image quality assessment (IQA) models trained on large-scale human
preference datasets. This reward encourages both structural fidelity and
natural appearance. To better utilize perceptual diversity, we move beyond the
standard best-vs-worst selection and construct multiple preference pairs from
outputs of the same model. Our analysis reveals that the optimal selection
ratio depends on model capacity: smaller models benefit from broader coverage,
while larger models respond better to stronger contrast in supervision.
Furthermore, we propose hierarchical preference optimization, which adaptively
weights training pairs based on intra-group reward gaps and inter-group
diversity, enabling more efficient and stable learning. Extensive experiments
across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR
significantly improves perceptual quality and generalizes well to real-world
benchmarks.
comment: Accept by NeurIPS 2025
☆ See the Text: From Tokenization to Visual Reading
People see text. Humans read by recognizing words as visual objects,
including their shapes, layouts, and patterns, before connecting them to
meaning, which enables us to handle typos, distorted fonts, and various scripts
effectively. Modern large language models (LLMs), however, rely on subword
tokenization, fragmenting text into pieces from a fixed vocabulary. While
effective for high-resource languages, this approach over-segments low-resource
languages, yielding long, linguistically meaningless sequences and inflating
computation. In this work, we challenge this entrenched paradigm and move
toward a vision-centric alternative. Our method, SeeTok, renders text as images
(visual-text) and leverages pretrained multimodal LLMs to interpret them,
reusing strong OCR and text-vision alignment abilities learned from large-scale
multimodal training. Across three different language tasks, SeeTok matches or
surpasses subword tokenizers while requiring 4.43 times fewer tokens and
reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization,
robustness to typographic noise, and linguistic hierarchy. SeeTok signals a
shift from symbolic tokenization to human-like visual reading, and takes a step
toward more natural and cognitively inspired language models.
☆ FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning
Federated learning (FL) enables multiple clients to collaboratively train
machine learning models without exposing local data, balancing performance and
privacy. However, domain shift and label heterogeneity across clients often
hinder the generalization of the aggregated global model. Recently, large-scale
vision-language models like CLIP have shown strong zero-shot classification
capabilities, raising the question of how to effectively fine-tune CLIP across
domains in a federated setting. In this work, we propose an adaptive federated
prompt tuning framework, FedDEAP, to enhance CLIP's generalization in
multi-domain scenarios. Our method includes the following three key components:
(1) To mitigate the loss of domain-specific information caused by
label-supervised tuning, we disentangle semantic and domain-specific features
in images by using semantic and domain transformation networks with unbiased
mappings; (2) To preserve domain-specific knowledge during global prompt
aggregation, we introduce a dual-prompt design with a global semantic prompt
and a local domain prompt to balance shared and personalized information; (3)
To maximize the inclusion of semantic and domain information from images in the
generated text features, we align textual and visual representations under the
two learned transformations to preserve semantic and domain consistency.
Theoretical analysis and extensive experiments on four datasets demonstrate the
effectiveness of our method in enhancing the generalization of CLIP for
federated image recognition across multiple domains.
comment: Accepted at MM 2025
☆ Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework NeurIPS 2025
Graph Transformers (GTs) have emerged as a powerful paradigm for graph
representation learning due to their ability to model diverse node
interactions. However, existing GTs often rely on intricate architectural
designs tailored to specific interactions, limiting their flexibility. To
address this, we propose a unified hierarchical mask framework that reveals an
underlying equivalence between model architecture and attention mask
construction. This framework enables a consistent modeling paradigm by
capturing diverse interactions through carefully designed attention masks.
Theoretical analysis under this framework demonstrates that the probability of
correct classification positively correlates with the receptive field size and
label consistency, leading to a fundamental design principle: an effective
attention mask should ensure both a sufficiently large receptive field and a
high level of label consistency. While no single existing mask satisfies this
principle across all scenarios, our analysis reveals that hierarchical masks
offer complementary strengths, motivating their effective integration. Then, we
introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with
Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates
three theoretically grounded hierarchical masks and employs a bi-level expert
routing mechanism to adaptively integrate multi-level interaction information.
To ensure scalability, we further introduce a dual attention computation scheme
that dynamically switches between dense and sparse modes based on local mask
sparsity. Extensive experiments across multiple benchmarks demonstrate that
M3Dphormer achieves state-of-the-art performance, validating the effectiveness
of our unified framework and model design.
comment: Accepted by NeurIPS 2025 (Poster)
☆ SAM 2++: Tracking Anything at Any Granularity
Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang
Video tracking aims at finding the specific target in subsequent frames given
its initial state. Due to the varying granularity of target states across
different tasks, most existing trackers are tailored to a single task and
heavily rely on custom-designed modules within the individual task, which
limits their generalization and leads to redundancy in both model design and
parameters. To unify video tracking tasks, we present SAM 2++, a unified model
towards tracking at any granularity, including masks, boxes, and points. First,
to extend target granularity, we design task-specific prompts to encode various
task inputs into general prompt embeddings, and a unified decoder to unify
diverse task results into a unified form pre-output. Next, to satisfy memory
matching, the core operation of tracking, we introduce a task-adaptive memory
mechanism that unifies memory across different granularities. Finally, we
introduce a customized data engine to support tracking training at any
granularity, producing a large and diverse video tracking dataset with rich
annotations at three granularities, termed Tracking-Any-Granularity, which
represents a comprehensive resource for training and benchmarking on unified
tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++
sets a new state of the art across diverse tracking tasks at different
granularities, establishing a unified and robust tracking framework.
comment: 8 pages, and 10 pages in Supplementary Material
☆ An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection
Tuberculosis remains a critical global health issue, particularly in
resource-limited and remote areas. Early detection is vital for treatment, yet
the lack of skilled radiologists underscores the need for artificial
intelligence (AI)-driven screening tools. Developing reliable AI models is
challenging due to the necessity for large, high-quality datasets, which are
costly to obtain. To tackle this, we propose a teacher--student framework which
enhances both disease and symptom detection on chest X-rays by integrating two
supervised heads and a self-supervised head. Our model achieves an accuracy of
98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and
a macro-F1 score of 90.09% for multilabel symptom detection, significantly
outperforming baselines. The explainability assessments also show the model
bases its predictions on relevant anatomical features, demonstrating promise
for deployment in clinical screening and triage settings.
comment: 16 pages, 3 figures
☆ A Geometric Approach to Steerable Convolutions
In contrast to the somewhat abstract, group theoretical approach adopted by
many papers, our work provides a new and more intuitive derivation of steerable
convolutional neural networks in $d$ dimensions. This derivation is based on
geometric arguments and fundamental principles of pattern matching. We offer an
intuitive explanation for the appearance of the Clebsch--Gordan decomposition
and spherical harmonic basis functions. Furthermore, we suggest a novel way to
construct steerable convolution layers using interpolation kernels that improve
upon existing implementation, and offer greater robustness to noisy data.
☆ ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Xiaoxing Hu, Kaicheng Yang, Ziyong Feng, Qi Ming, Zonghao Guo, Xiang An, Ziyong Feng, Junchi Yan, Xue Yang
The original CLIP text encoder is limited by a maximum input length of 77
tokens, which hampers its ability to effectively process long texts and perform
fine-grained semantic understanding. In addition, the CLIP text encoder lacks
support for multilingual inputs. All these limitations significantly restrict
its applicability across a broader range of tasks. Recent studies have
attempted to replace the CLIP text encoder with an LLM-based embedder to
enhance its ability in processing long texts, multilingual understanding, and
fine-grained semantic comprehension. However, because the representation spaces
of LLMs and the vision-language space of CLIP are pretrained independently
without alignment priors, direct alignment using contrastive learning can
disrupt the intrinsic vision-language alignment in the CLIP image encoder,
leading to an underutilization of the knowledge acquired during pre-training.
To address this challenge, we propose ProCLIP, a curriculum learning-based
progressive vision-language alignment framework to effectively align the CLIP
image encoder with an LLM-based embedder. Specifically, ProCLIP first distills
knowledge from CLIP's text encoder into the LLM-based embedder to leverage
CLIP's rich pretrained knowledge while establishing initial alignment between
the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns
the CLIP image encoder with the LLM-based embedder through image-text
contrastive tuning, employing self-distillation regularization to avoid
overfitting. To achieve a more effective alignment, instance semantic alignment
loss and embedding structure alignment loss are employed during representation
inheritance and contrastive tuning. The Code is available at
https://github.com/VisionXLab/ProCLIP
comment: 17 pages, 5 fiugres
☆ Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
A recent class of hyperspectral anomaly detection methods that can be trained
once on background datasets and then universally deployed -- without per-scene
retraining or parameter tuning -- has demonstrated remarkable efficiency and
robustness. Building upon this paradigm, we focus on the integration of
spectral and spatial cues and introduce a novel "Rebellious Student" framework
for complementary feature learning. Unlike conventional teacher-student
paradigms driven by imitation, our method intentionally trains the spatial
branch to diverge from the spectral teacher, thereby learning complementary
spatial patterns that the teacher fails to capture. A two-stage learning
strategy is adopted: (1) a spectral enhancement network is first trained via
reverse distillation to obtain robust background spectral representations; and
(2) a spatial network -- the rebellious student -- is subsequently optimized
using decorrelation losses that enforce feature orthogonality while maintaining
reconstruction fidelity to avoid irrelevant noise. Once trained, the framework
enhances both spectral and spatial background features, enabling parameter-free
and training-free anomaly detection when paired with conventional detectors.
Extensive experiments on the HAD100 benchmark show substantial improvements
over several established baselines with minimal computational overhead,
confirming the effectiveness and generality of the proposed complementary
learning paradigm. Our code is publicly available at
https://github.com/xjpp2016/FERS.
☆ UltraGen: High-Resolution Video Generation with Hierarchical Attention
Recent advances in video generation have made it possible to produce visually
compelling videos, with wide-ranging applications in content creation,
entertainment, and virtual reality. However, most existing diffusion
transformer based video generation models are limited to low-resolution outputs
(<=720P) due to the quadratic computational complexity of the attention
mechanism with respect to the output width and height. This computational
bottleneck makes native high-resolution video generation (1080P/2K/4K)
impractical for both training and inference. To address this challenge, we
present UltraGen, a novel video generation framework that enables i) efficient
and ii) end-to-end native high-resolution video synthesis. Specifically,
UltraGen features a hierarchical dual-branch attention architecture based on
global-local attention decomposition, which decouples full attention into a
local attention branch for high-fidelity regional content and a global
attention branch for overall semantic consistency. We further propose a
spatially compressed global modeling strategy to efficiently learn global
dependencies, and a hierarchical cross-window local attention mechanism to
reduce computational costs while enhancing information flow across different
local windows. Extensive experiments demonstrate that UltraGen can effectively
scale pre-trained low-resolution video models to 1080P and even 4K resolution
for the first time, outperforming existing state-of-the-art methods and
super-resolution based two-stage pipelines in both qualitative and quantitative
evaluations.
☆ Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction NeurIPS 2025
Jannis Fleckenstein, David Kreismann, Tamara Rosemary Govindasamy, Thomas Brunschwiler, Etienne Vos, Mattia Rigotti
As urbanization and climate change progress, urban heat island effects are
becoming more frequent and severe. To formulate effective mitigation plans,
cities require detailed air temperature data, yet conventional machine learning
models with limited data often produce inaccurate predictions, particularly in
underserved areas. Geospatial foundation models trained on global unstructured
data offer a promising alternative by demonstrating strong generalization and
requiring only minimal fine-tuning. In this study, an empirical ground truth of
urban heat patterns is established by quantifying cooling effects from green
spaces and benchmarking them against model predictions to evaluate the model's
accuracy. The foundation model is subsequently fine-tuned to predict land
surface temperatures under future climate scenarios, and its practical value is
demonstrated through a simulated inpainting that highlights its role for
mitigation support. The results indicate that foundation models offer a
powerful way for evaluating urban heat island mitigation strategies in
data-scarce regions to support more climate-resilient cities.
comment: 10 pages, 9 figures. Accepted at the NeurIPS 2025 Workshop on
Tackling Climate Change with Machine Learning
☆ Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation
Climate change is intensifying the occurrence of harmful algal bloom (HAB),
particularly cyanobacteria, which threaten aquatic ecosystems and human health
through oxygen depletion, toxin release, and disruption of marine biodiversity.
Traditional monitoring approaches, such as manual water sampling, remain
labor-intensive and limited in spatial and temporal coverage. Recent advances
in vision-language models (VLMs) for remote sensing have shown potential for
scalable AI-driven solutions, yet challenges remain in reasoning over imagery
and quantifying bloom severity. In this work, we introduce ALGae Observation
and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB
monitoring that combines remote sensing image understanding with severity
estimation. Our approach integrates GeoSAM-assisted human evaluation for
high-quality segmentation mask curation and fine-tunes vision language model on
severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML)
from NASA. Experiments demonstrate that ALGOS achieves robust performance on
both segmentation and severity-level estimation, paving the way toward
practical and automated cyanobacterial monitoring systems.
☆ SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery NeurIPS 2025
This paper investigates the problem of Generalized Category Discovery (GCD).
Given a partially labelled dataset, GCD aims to categorize all unlabelled
images, regardless of whether they belong to known or unknown classes. Existing
approaches typically depend on either single-level semantics or manually
designed abstract hierarchies, which limit their generalizability and
scalability. To address these limitations, we introduce a SEmantic-aware
hierArchical Learning framework (SEAL), guided by naturally occurring and
easily accessible hierarchical structures. Within SEAL, we propose a
Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits
hierarchical similarity to generate informative soft negatives, addressing the
limitations of conventional contrastive losses that treat all negatives
equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed
to align the predictions from different levels of granularity. SEAL
consistently achieves state-of-the-art performance on fine-grained benchmarks,
including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and
further demonstrates generalization on coarse-grained datasets. Project page:
https://visual-ai.github.io/seal/
comment: Accepted to NeurIPS 2025
☆ Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for real-time
view synthesis in colonoscopy, enabling critical applications such as virtual
colonoscopy and lesion tracking. However, the vanilla 3DGS assumes static
illumination and that observed appearance depends solely on viewing angle,
which causes incompatibility with the photometric variations in colonoscopic
scenes induced by dynamic light source/camera. This mismatch forces most 3DGS
methods to introduce structure-violating vaporous Gaussian blobs between the
camera and tissues to compensate for illumination attenuation, ultimately
degrading the quality of 3D reconstructions. Previous works only consider the
illumination attenuation caused by light distance, ignoring the physical
characters of light source and camera. In this paper, we propose ColIAGS, an
improved 3DGS framework tailored for colonoscopy. To mimic realistic appearance
under varying illumination, we introduce an Improved Appearance Modeling with
two types of illumination attenuation factors, which enables Gaussians to adapt
to photometric variations while preserving geometry accuracy. To ensure the
geometry approximation condition of appearance modeling, we propose an Improved
Geometry Modeling using high-dimensional view embedding to enhance Gaussian
geometry attribute prediction. Furthermore, another cosine embedding input is
leveraged to generate illumination attenuation solutions in an implicit manner.
Comprehensive experimental results on standard benchmarks demonstrate that our
proposed ColIAGS achieves the dual capabilities of novel view synthesis and
accurate geometric reconstruction. It notably outperforms other
state-of-the-art methods by achieving superior rendering fidelity while
significantly reducing Depth MSE. Code will be available.
☆ IF-VidCap: Can Video Caption Models Follow Instructions?
Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu
Although Multimodal Large Language Models (MLLMs) have demonstrated
proficiency in video captioning, practical applications require captions that
follow specific user instructions rather than generating exhaustive,
unconstrained descriptions. Current benchmarks, however, primarily assess
descriptive comprehensiveness while largely overlooking instruction-following
capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for
evaluating controllable video captioning, which contains 1,400 high-quality
samples. Distinct from existing video captioning or general
instruction-following benchmarks, IF-VidCap incorporates a systematic framework
that assesses captions on two dimensions: format correctness and content
correctness. Our comprehensive evaluation of over 20 prominent models reveals a
nuanced landscape: despite the continued dominance of proprietary models, the
performance gap is closing, with top-tier open-source solutions now achieving
near-parity. Furthermore, we find that models specialized for dense captioning
underperform general-purpose MLLMs on complex instructions, indicating that
future work should simultaneously advance both descriptive richness and
instruction-following fidelity.
comment: https://github.com/NJU-LINK/IF-VidCap
☆ SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
Autoregressive image generation models like Janus-Pro produce high-quality
images, but at the significant cost of high memory and ever-growing
computational demands due to the large number of visual tokens. While KV cache
compression has been extensively studied in language modeling, it still remains
largely unexplored for the image generation domain. In this work, we begin by
identifying a distinct and prominent attention phenomenon, which we term
spatial locality and emergent semantic sink. To leverage this key insight, we
introduce a novel KV cache compression framework. Specifically, we compress the
KV cache for all visual tokens by adaptively decoupling attention heads into
two separate types: for spatial-locality heads, our method maintains a short
recent token window; for semantic-sink heads, it strategically preserves a
compact set of highly-attended tokens. Our extensive experiments demonstrate
that the proposed method achieves a 5$\times$ reduction in memory usage and a
notable 6.6$\times$ speedup in overall throughput with only minimal visual
quality loss, thereby enabling highly efficient native autoregressive image
generation on resource-constrained hardware.
☆ PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting NeurIPS 2025
Changkun Liu, Bin Tan, Zeran Ke, Shangzhan Zhang, Jiachen Liu, Ming Qian, Nan Xue, Yujun Shen, Tristan Braud
This paper addresses metric 3D reconstruction of indoor scenes by exploiting
their inherent geometric regularities with compact representations. Using
planar 3D primitives - a well-suited representation for man-made environments -
we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction
from unposed two-view images. Our approach employs Vision Transformers to
extract a set of sparse planar primitives, estimate relative camera poses, and
supervise geometry learning via planar splatting, where gradients are
propagated through high-resolution rendered depth and normal maps of
primitives. Unlike prior feedforward methods that require 3D plane annotations
during training, PLANA3R learns planar 3D structures without explicit plane
supervision, enabling scalable training on large-scale stereo datasets using
only depth and normal annotations. We validate PLANA3R on multiple indoor-scene
datasets with metric supervision and demonstrate strong generalization to
out-of-domain indoor environments across diverse tasks under metric evaluation
protocols, including 3D surface reconstruction, depth estimation, and relative
pose estimation. Furthermore, by formulating with planar 3D representation, our
method emerges with the ability for accurate plane segmentation. The project
page is available at https://lck666666.github.io/plana3r
comment: 39th Conference on Neural Information Processing Systems (NeurIPS
2025). The project page is available at: https://lck666666.github.io/plana3r
☆ A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Recently, action recognition has been dominated by transformer-based methods,
thanks to their spatiotemporal contextual aggregation capacities. However,
despite the significant progress achieved on scene-related datasets, they do
not perform well on motion-sensitive datasets due to the lack of elaborate
motion modeling designs. Meanwhile, we observe that the widely-used cost volume
in traditional action recognition is highly similar to the affinity matrix
defined in self-attention, but equipped with powerful motion modeling
capacities. In light of this, we propose to integrate those effective motion
modeling properties into the existing transformer in a unified and neat way,
with the proposal of the Explicit Motion Information Mining module (EMIM). In
EMIM, we propose to construct the desirable affinity matrix in a cost volume
style, where the set of key candidate tokens is sampled from the query-based
neighboring area in the next frame in a sliding-window manner. Then, the
constructed affinity matrix is used to aggregate contextual information for
appearance modeling and is converted into motion features for motion modeling
as well. We validate the motion modeling capacities of our method on four
widely-used datasets, and our method performs better than existing
state-of-the-art approaches, especially on motion-sensitive datasets, i.e.,
Something-Something V1 & V2.
comment: accepted by Pattern Recognition. We have been always curious to see
whether our designs could be beneficial in other scenarios, such as embedding
it into the DiT model or 3D-VAE for video generation. If you are interested
in it, why not give it a shot?
☆ Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Contrastive vision-language models such as CLIP have demonstrated strong
performance across a wide range of multimodal tasks by learning from aligned
image-text pairs. However, their ability to handle complex, real-world web
documents remains limited, particularly in scenarios where text and images are
interleaved, loosely aligned, or embedded in visual form. To address these
challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified
framework that models text, images, and their combinations using a single
vision transformer. VC2L operates entirely in pixel space by rendering all
inputs, whether textual, visual, or combined, as images, thus eliminating the
need for OCR, text tokenization, or modality fusion strategy. To capture
complex cross-modal relationships in multimodal web documents, VC2L employs a
snippet-level contrastive learning objective that aligns consecutive multimodal
segments, leveraging the inherent coherence of documents without requiring
explicitly paired image-text data. To assess the effectiveness of this
approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR,
designed to evaluate cross-modal retrieval, fine-grained sequential
understanding, and generalization to unseen data, respectively. Empirical
results show that VC2L achieves competitive or superior performance compared to
CLIP-style models on both the proposed benchmarks and established datasets such
as M-BEIR and MTEB. These findings underscore the potential of multimodal web
data as a valuable training resource for contrastive learning and illustrate
the scalability of a unified, vision-centric approach for multimodal
representation learning. Code and models are available at:
https://github.com/showlab/VC2L.
comment: Project page: this https://linyq17.github.io/VC2L/
☆ UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Recent progress in text-to-image (T2I) generation underscores the importance
of reliable benchmarks in evaluating how accurately generated images reflect
the semantics of their textual prompt. However, (1) existing benchmarks lack
the diversity of prompt scenarios and multilingual support, both essential for
real-world applicability; (2) they offer only coarse evaluations across primary
dimensions, covering a narrow range of sub-dimensions, and fall short in
fine-grained sub-dimension assessment. To address these limitations, we
introduce UniGenBench++, a unified semantic assessment benchmark for T2I
generation. Specifically, it comprises 600 prompts organized hierarchically to
ensure both coverage and efficiency: (1) spans across diverse real-world
scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively
probes T2I models' semantic consistency over 10 primary and 27 sub evaluation
criteria, with each prompt assessing multiple testpoints. To rigorously assess
model robustness to variations in language and prompt length, we provide both
English and Chinese versions of each prompt in short and long forms. Leveraging
the general world knowledge and fine-grained image understanding capabilities
of a closed-source Multi-modal Large Language Model (MLLM), i.e.,
Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark
construction and streamlined model assessment. Moreover, to further facilitate
community use, we train a robust evaluation model that enables offline
assessment of T2I model outputs. Through comprehensive benchmarking of both
open- and closed-sourced T2I models, we systematically reveal their strengths
and weaknesses across various aspects.
comment: Project page: codegoat24.github.io/UniGenBench/
☆ MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by
the quadratic scaling of full attention with sequence length. Since attention
is highly redundant, outputs are dominated by a small subset of query-key
pairs. Existing sparse methods rely on blockwise coarse estimation, whose
accuracy-efficiency trade-offs are constrained by block size. This paper
introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention
that uses a lightweight, learnable token router to precisely match tokens
without blockwise estimation. Through semantic-aware routing, MoGA enables
effective long-range interactions. As a kernel-free method, MoGA integrates
seamlessly with modern attention stacks, including FlashAttention and sequence
parallelism. Building on MoGA, we develop an efficient long video generation
model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps,
with a context length of approximately 580k. Comprehensive experiments on
various video generation tasks validate the effectiveness of our approach.
comment: 15 pages, 12 figures
☆ Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification
This paper investigates various factors that influence the performance of
end-to-end deep learning approaches for historical writer identification (HWI),
a task that remains challenging due to the diversity of handwriting styles,
document degradation, and the limited number of labelled samples per writer.
These conditions often make accurate recognition difficult, even for human
experts. Traditional HWI methods typically rely on handcrafted image processing
and clustering techniques, which tend to perform well on small and carefully
curated datasets. In contrast, end-to-end pipelines aim to automate the process
by learning features directly from document images. However, our experiments
show that many of these models struggle to generalise in more realistic,
document-level settings, especially under zero-shot scenarios where writers in
the test set are not present in the training data. We explore different
combinations of pre-processing methods, backbone architectures, and
post-processing strategies, including text segmentation, patch sampling, and
feature aggregation. The results suggest that most configurations perform
poorly due to weak capture of low-level visual features, inconsistent patch
representations, and high sensitivity to content noise. Still, we identify one
end-to-end setup that achieves results comparable to the top-performing system,
despite using a simpler design. These findings point to key challenges in
building robust end-to-end systems and offer insight into design choices that
improve performance in historical document writer identification.
comment: Published in The 12th IEEE International Conference on Data Science
and Advanced Analytics (DSAA), 2025
☆ Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches
Mustafa Fuad Rifet Ibrahim, Tunc Alkanat, Maurice Meijer, Felix Manthey, Alexander Schlaefer, Peer Stelldinger
The vast majority of cardiovascular diseases may be preventable if early
signs and risk factors are detected. Cardiovascular monitoring with body-worn
sensor devices like sensor patches allows for the detection of such signs while
preserving the freedom and comfort of patients. However, the analysis of the
sensor data must be robust, reliable, efficient, and highly accurate. Deep
learning methods can automate data interpretation, reducing the workload of
clinicians. In this work, we analyze the feasibility of applying deep learning
models to the classification of synchronized electrocardiogram (ECG) and
phonocardiogram (PCG) recordings on resource-constrained medical edge devices.
We propose a convolutional neural network with early fusion of data to solve a
binary classification problem. We train and validate our model on the
synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset.
Our approach reduces memory footprint and compute cost by three orders of
magnitude compared to the state-of-the-art while maintaining competitive
accuracy. We demonstrate the applicability of our proposed model on medical
edge devices by analyzing energy consumption on a microcontroller and an
experimental sensor device setup, confirming that on-device inference can be
more energy-efficient than continuous data streaming.
comment: Submitted to the IEEE Journal of Biomedical And Health Informatics
☆ Image augmentation with invertible networks in interactive satellite image change detection
This paper devises a novel interactive satellite image change detection
algorithm based on active learning. Our framework employs an iterative process
that leverages a question-and-answer model. This model queries the oracle
(user) about the labels of a small subset of images (dubbed as display), and
based on the oracle's responses, change detection model is dynamically updated.
The main contribution of our framework resides in a novel invertible network
that allows augmenting displays, by mapping them from highly nonlinear input
spaces to latent ones, where augmentation transformations become linear and
more tractable. The resulting augmented data are afterwards mapped back to the
input space, and used to retrain more effective change detection criteria in
the subsequent iterations of active learning. Experimental results demonstrate
superior performance of our proposed method compared to the related work.
☆ Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression NeurIPS 2025
This paper proposes a novel matrix quantization method, Binary Quadratic
Quantization (BQQ). In contrast to conventional first-order quantization
approaches, such as uniform quantization and binary coding quantization, that
approximate real-valued matrices via linear combinations of binary bases, BQQ
leverages the expressive power of binary quadratic expressions while
maintaining an extremely compact data format. We validate our approach with two
experiments: a matrix compression benchmark and post-training quantization
(PTQ) on pretrained Vision Transformer-based models. Experimental results
demonstrate that BQQ consistently achieves a superior trade-off between memory
efficiency and reconstruction error than conventional methods for compressing
diverse matrix data. It also delivers strong PTQ performance, even though we
neither target state-of-the-art PTQ accuracy under tight memory constraints nor
rely on PTQ-specific binary matrix optimization. For example, our proposed
method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on
the ImageNet dataset under the calibration-based and data-free scenarios,
respectively, with quantization equivalent to 2 bits. These findings highlight
the surprising effectiveness of binary quadratic expressions for efficient
matrix approximation and neural network compression.
comment: Accepted to NeurIPS 2025
☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological
samples remains a challenge in the life sciences. EM data captures details of
biological structures, sometimes with such complexity that even human observers
can find it overwhelming. We introduce {\epsilon}-Seg, a method based on
hierarchical variational autoencoders (HVAEs), employing center-region masking,
sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior,
and clustering-free label prediction. Center-region masking and the inpainting
loss encourage the model to learn robust and representative embeddings to
distinguish the desired classes, even if training labels are sparse (0.05% of
the total image data or less). For optimal performance, we employ CL and a GMM
prior to shape the latent space of the HVAE such that encoded input patches
tend to cluster wrt. the semantic classes we wish to distinguish. Finally,
instead of clustering latent embeddings for semantic segmentation, we propose a
MLP semantic segmentation head to directly predict class labels from latent
embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on
2 dense EM datasets of biological tissues and demonstrate the applicability of
our method also on fluorescence microscopy data. Our results show that
{\epsilon}-Seg is capable of achieving competitive sparsely-supervised
segmentation results on complex biological image data, even if only limited
amounts of training labels are available.
comment: 10 pages main text, 17 pages total
☆ C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression BMVC2025
Neural network compression has gained increasing attention in recent years,
particularly in computer vision applications, where the need for model
reduction is crucial for overcoming deployment constraints. Pruning is a widely
used technique that prompts sparsity in model structures, e.g. weights,
neurons, and layers, reducing size and inference costs. Structured pruning is
especially important as it allows for the removal of entire structures, which
further accelerates inference time and reduces memory overhead. However, it can
be computationally expensive, requiring iterative retraining and optimization.
To overcome this problem, recent methods considered one-shot setting, which
applies pruning directly at post-training. Unfortunately, they often lead to a
considerable drop in performance. In this paper, we focus on this issue by
proposing a novel one-shot pruning framework that relies on explainable deep
learning. First, we introduce a causal-aware pruning approach that leverages
cause-effect relations between model predictions and structures in a
progressive pruning process. It allows us to efficiently reduce the size of the
network, ensuring that the removed structures do not deter the performance of
the model. Then, through experiments conducted on convolution neural network
and vision transformer baselines, pre-trained on classification tasks, we
demonstrate that our method consistently achieves substantial reductions in
model size, with minimal impact on performance, and without the need for
fine-tuning. Overall, our approach outperforms its counterparts, offering the
best trade-off. Our code is available on GitHub.
comment: 10 pages, BMVC2025
☆ Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
Though recent advances in vision-language models (VLMs) have achieved
remarkable progress across a wide range of multimodal tasks, understanding 3D
spatial relationships from limited views remains a significant challenge.
Previous reasoning methods typically rely on pure text (e.g., topological
cognitive maps) or on 2D visual cues. However, their limited representational
capacity hinders performance in specific tasks that require 3D spatial
imagination. To address this limitation, we propose 3DThinker, a framework that
can effectively exploits the rich geometric information embedded within images
while reasoning, like humans do. Our framework is the first to enable 3D
mentaling during reasoning without any 3D prior input, and it does not rely on
explicitly labeled 3D data for training. Specifically, our training consists of
two stages. First, we perform supervised training to align the 3D latent
generated by VLM while reasoning with that of a 3D foundation model (e.g.,
VGGT). Then, we optimize the entire reasoning trajectory solely based on
outcome signals, thereby refining the underlying 3D mentaling. Extensive
experiments across multiple benchmarks show that 3DThinker consistently
outperforms strong baselines and offers a new perspective toward unifying 3D
representations into multimodal reasoning. Our code will be available at
https://github.com/zhangquanchen/3DThinker.
comment: 12 pages, 4 figures
☆ CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun
Computer-using agents (CUAs) enable task completion through natural
interaction with operating systems and software interfaces. While script-based
verifiers are widely adopted for evaluation, they suffer from limited
scalability and inability to provide step-wise assessment. Reward models offer
promising alternatives, but their effectiveness on CUA evaluation remains
largely underexplored. To address this gap, we present CUARewardBench,
comprising four key contributions: (1) First-ever Comprehensive CUA Reward
Benchmark: We introduce the first benchmark for evaluating both outcome reward
models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic
assessment across trajectory-level and step-level evaluation. (2) Diverse,
Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10
software categories and 7 agent architectures with varying performance levels
(25.9%-50.8% success rates). All trajectories are expertly annotated through
carefully designed protocols, with rigorous quality control to ensure
reliability and practical applicability. (3) Comprehensive Analysis and
Insights: Through extensive experiments across 7 vision-language models and 3
prompt templates, we reveal critical limitations of current CUA RMs, including
insufficient visual reasoning capabilities, knowledge deficiencies, and the
superiority of general VLMs over specialized CUA models for reward evaluation.
(4) Unanimous Prompt Ensemble (UPE): Based on the insights from our
comprehensive analysis, we propose UPE, a novel ensemble method that
significantly enhances reward model reliability through strict unanimous voting
and strategic prompt-template configurations. UPE achieves 89.8% precision and
93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially
outperforming single VLMs and traditional ensemble approaches.
comment: 24 pages, 6 figures
☆ CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder NeurIPS 2025
Multimodal dataset distillation aims to synthesize a small set of image-text
pairs that enables efficient training of large-scale vision-language models.
While dataset distillation has shown promise in unimodal tasks, extending it to
multimodal contrastive learning presents key challenges: learning cross-modal
alignment and managing the high computational cost of large encoders. Prior
approaches address scalability by freezing the text encoder and update only the
image encoder and text projection layer. However, we find this severely limits
semantic alignment and becomes a bottleneck for performance scaling. We propose
CovMatch, a scalable dataset distillation framework that aligns the
cross-covariance of real and synthetic features while regularizing feature
distributions within each modality. Unlike prior approaches, CovMatch enables
joint optimization of both encoders, leading to stronger cross-modal alignment
and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms
state-of-the-art multimodal distillation methods and achieves up to 6.8%
absolute gains in retrieval accuracy using only 500 synthetic pairs.
comment: NeurIPS 2025
☆ Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, Meng Wang
We present Kaleido, a subject-to-video~(S2V) generation framework, which aims
to synthesize subject-consistent videos conditioned on multiple reference
images of target subjects. Despite recent progress in S2V generation models,
existing approaches remain inadequate at maintaining multi-subject consistency
and at handling background disentanglement, often resulting in lower reference
fidelity and semantic drift under multi-image conditioning. These shortcomings
can be attributed to several factors. Primarily, the training dataset suffers
from a lack of diversity and high-quality samples, as well as cross-paired
data, i.e., paired samples whose components originate from different instances.
In addition, the current mechanism for integrating multiple reference images is
suboptimal, potentially resulting in the confusion of multiple subjects. To
overcome these limitations, we propose a dedicated data construction pipeline,
incorporating low-quality sample filtering and diverse data synthesis, to
produce consistency-preserving training data. Moreover, we introduce Reference
Rotary Positional Encoding (R-RoPE) to process reference images, enabling
stable and precise multi-image integration. Extensive experiments across
numerous benchmarks demonstrate that Kaleido significantly outperforms previous
methods in consistency, fidelity, and generalization, marking an advance in S2V
generation.
comment: 11 pages, 6 figures
☆ Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving
Sanjay Kumar, Tim Brophy, Reenu Mohandas, Eoin Martino Grua, Ganesh Sistu, Valentina Donzella, Ciaran Eising
Robust perception in automated driving requires reliable performance under
adverse conditions, where sensors may be affected by partial failures or
environmental occlusions. Although existing autonomous driving datasets
inherently contain sensor noise and environmental variability, very few enable
controlled, parameterised, and reproducible degradations across multiple
sensing modalities. This gap limits the ability to systematically evaluate how
perception and fusion architectures perform under well-defined adverse
conditions. To address this limitation, we introduce the Occluded nuScenes
Dataset, a novel extension of the widely used nuScenes benchmark. For the
camera modality, we release both the full and mini versions with four types of
occlusions, two adapted from public implementations and two newly designed. For
radar and LiDAR, we provide parameterised occlusion scripts that implement
three types of degradations each, enabling flexible and repeatable generation
of occluded data. This resource supports consistent, reproducible evaluation of
perception models under partial sensor failures and environmental interference.
By releasing the first multi-sensor occlusion dataset with controlled and
reproducible degradations, we aim to advance research on robust sensor fusion,
resilience analysis, and safety-critical perception in automated driving.
☆ GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization
Dušan Malić, Christian Fruhwirth-Reisinger, Alexander Prutsch, Wei Lin, Samuel Schulter, Horst Possegger
This technical report outlines the top-ranking solution for RoboSense 2025:
Track 3, achieving state-of-the-art performance on 3D object detection under
various sensor placements. Our submission utilizes GBlobs, a local point cloud
feature descriptor specifically designed to enhance model generalization across
diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer
from a \enquote{geometric shortcut} when trained on conventional global
features (\ie, absolute Cartesian coordinates). This introduces a position bias
that causes models to primarily rely on absolute object position rather than
distinguishing shape and appearance characteristics. Although effective for
in-domain data, this shortcut severely limits generalization when encountering
different point distributions, such as those resulting from varying sensor
placements. By using GBlobs as network input features, we effectively
circumvent this geometric shortcut, compelling the network to learn robust,
object-centric representations. This approach significantly enhances the
model's ability to generalize, resulting in the exceptional performance
demonstrated in this challenge.
comment: 1st place at the IROS'25 RoboSense Challenge, Track #3: Cross-Sensor
Placement 3D Object Detection
☆ RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
Typical template-based object pose pipelines estimate the pose by retrieving
the closest matching template and aligning it with the observed image. However,
failure to retrieve the correct template often leads to inaccurate pose
predictions. To address this, we reformulate template-based object pose
estimation as a ray alignment problem, where the viewing directions from
multiple posed template images are learned to align with a non-posed query
image. Inspired by recent progress in diffusion-based camera pose estimation,
we embed this formulation into a diffusion transformer architecture that aligns
a query image with a set of posed templates. We reparameterize object rotation
using object-centered camera rays and model object translation by extending
scale-invariant translation estimation to dense translation offsets. Our model
leverages geometric priors from the templates to guide accurate query pose
inference. A coarse-to-fine training strategy based on narrowed template
sampling improves performance without modifying the network architecture.
Extensive experiments across multiple benchmark datasets show competitive
results of our method compared to state-of-the-art approaches in unseen object
pose estimation.
☆ DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
The rise of convenience packaging has led to generation of enormous waste,
making efficient waste sorting crucial for sustainable waste management. To
address this, we developed DWaste, a computer vision-powered platform designed
for real-time waste sorting on resource-constrained smartphones and edge
devices, including offline functionality. We benchmarked various image
classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object
detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and
annotated it using the custom tool Annotated Lab. We found a clear trade-off
between accuracy and resource consumption: the best classifier,
EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency
(~ 0.22s) and elevated carbon emissions. In contrast, lightweight object
detection models delivered strong performance (up to 77% mAP) with ultra-fast
inference (~ 0.03s) and significantly smaller model sizes (< 7MB), making them
ideal for real-time, low-power use. Model quantization further maximized
efficiency, substantially reducing model size and VRAM usage by up to 75%. Our
work demonstrates the successful implementation of "Greener AI" models to
support real-time, sustainable waste sorting on edge devices.
comment: 8 pages, 8 figures
☆ Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Vehicle make and model recognition (VMMR) is an important task in intelligent
transportation systems, but existing approaches struggle to adapt to newly
released models. Contrastive Language-Image Pretraining (CLIP) provides strong
visual-text alignment, yet its fixed pretrained weights limit performance
without costly image-specific finetuning. We propose a pipeline that integrates
vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to
support zero-shot recognition through text-based reasoning. A VLM converts
vehicle images into descriptive attributes, which are compared against a
database of textual features. Relevant entries are retrieved and combined with
the description to form a prompt, and a language model (LM) infers the make and
model. This design avoids large-scale retraining and enables rapid updates by
adding textual descriptions of new vehicles. Experiments show that the proposed
method improves recognition by nearly 20% over the CLIP baseline, demonstrating
the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city
applications.
comment: Accepted by The 38th Conference of Open Innovations Association
FRUCT, 2025
☆ Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D
high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR)
videos captured with alternating exposures. To tackle such a challenging
problem, we present a unified framework with two-stage optimization approach
based on Gaussian Splatting. The first stage learns a video HDR Gaussian
representation in orthographic camera coordinate space, eliminating the need
for camera poses and enabling robust initial HDR video reconstruction. The
second stage transforms video Gaussians into world space and jointly refines
the world Gaussians with camera poses. Furthermore, we propose a temporal
luminance regularization strategy to enhance the temporal consistency of the
HDR appearance. Since our task has not been studied before, we construct a new
evaluation benchmark using publicly available datasets for HDR video
reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR
significantly outperforms alternative solutions adapted from state-of-the-art
methods in both rendering quality and speed.
comment: Project page is available at
https://liujf1226.github.io/Mono4DGS-HDR/
☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on
the quality of their visual tokenizer. While recent works have explored
incorporating Vision Foundation Models (VFMs) via distillation, we identify a
fundamental flaw in this approach: it inevitably weakens the robustness of
alignment with the original VFM, causing the aligned latents to deviate
semantically under distribution shifts. In this paper, we bypass distillation
by proposing a more direct approach: Vision Foundation Model Variational
Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's
semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE
decoder with Multi-Scale Latent Fusion and Progressive Resolution
Reconstruction blocks, enabling high-quality reconstruction from spatially
coarse VFM features. Furthermore, we provide a comprehensive analysis of
representation dynamics during diffusion training, introducing the proposed
SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows
us to develop a joint tokenizer-diffusion alignment strategy that dramatically
accelerates convergence. Our innovations in tokenizer design and training
strategy lead to superior performance and efficiency: our system reaches a gFID
(w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers).
With continued training to 640 epochs, it further attains a gFID (w/o CFG) of
1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: Code and models available at: https://github.com/tianciB/VFM-VAE
☆ LAND: Lung and Nodule Diffusion for 3D Chest CT Synthesis with Anatomical Guidance
Anna Oliveras, Roger Marí, Rafael Redondo, Oriol Guardià, Ana Tost, Bhalaji Nagarajan, Carolina Migliorelli, Vicent Ribas, Petia Radeva
This work introduces a new latent diffusion model to generate high-quality 3D
chest CT scans conditioned on 3D anatomical masks. The method synthesizes
volumetric images of size 256x256x256 at 1 mm isotropic resolution using a
single mid-range GPU, significantly lowering the computational cost compared to
existing approaches. The conditioning masks delineate lung and nodule regions,
enabling precise control over the output anatomical features. Experimental
results demonstrate that conditioning solely on nodule masks leads to
anatomically incorrect outputs, highlighting the importance of incorporating
global lung structure for accurate conditional synthesis. The proposed approach
supports the generation of diverse CT volumes with and without lung nodules of
varying attributes, providing a valuable tool for training AI models or
healthcare professionals.
☆ Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection ICCV 2025
At the core of Camouflaged Object Detection (COD) lies segmenting objects
from their highly similar surroundings. Previous efforts navigate this
challenge primarily through image-level modeling or annotation-based
optimization. Despite advancing considerably, this commonplace practice hardly
taps valuable dataset-level contextual information or relies on laborious
annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented
paradigm that exploits the entire training dataset to generate pseudo-labels
for single images, which could be used to train COD models. RISE begins by
constructing prototype libraries for environments and camouflaged objects using
training images (without ground truth), followed by K-Nearest Neighbor (KNN)
retrieval to generate pseudo-masks for each image based on these libraries. It
is important to recognize that using only training images without annotations
exerts a pronounced challenge in crafting high-quality prototype libraries. In
this light, we introduce a Clustering-then-Retrieval (CR) strategy, where
coarse masks are first generated through clustering, facilitating subsequent
histogram-based image filtering and cross-category retrieval to produce
high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect
of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which
integrates retrieval results from diverse views to produce more robust and
precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms
state-of-the-art unsupervised and prompt-based methods. Code is available at
https://github.com/xiaohainku/RISE.
comment: ICCV 2025
☆ ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
We introduce ImageGem, a dataset for studying generative models that
understand fine-grained individual preferences. We posit that a key challenge
hindering the development of such a generative model is the lack of in-the-wild
and fine-grained user preference annotations. Our dataset features real-world
interaction data from 57K users, who collectively have built 242K customized
LoRAs, written 3M text prompts, and created 5M generated images. With user
preference annotations from our dataset, we were able to train better
preference alignment models. In addition, leveraging individual user
preference, we investigated the performance of retrieval models and a
vision-language model on personalized image retrieval and generative model
recommendation. Finally, we propose an end-to-end framework for editing
customized diffusion models in a latent weight space to align with individual
user preferences. Our results demonstrate that the ImageGem dataset enables,
for the first time, a new paradigm for generative model personalization.
☆ ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters
Recent advancements in vision transformers (ViTs) have demonstrated that
larger models often achieve superior performance. However, training these
models remains computationally intensive and costly. To address this challenge,
we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike
conventional training from scratch, ScaleNet facilitates rapid model expansion
with negligible increases in parameters, building on existing pretrained
models. This offers a cost-effective solution for scaling up ViTs.
Specifically, ScaleNet achieves model expansion by inserting additional layers
into pretrained ViTs, utilizing layer-wise weight sharing to maintain
parameters efficiency. Each added layer shares its parameter tensor with a
corresponding layer from the pretrained model. To mitigate potential
performance degradation due to shared weights, ScaleNet introduces a small set
of adjustment parameters for each layer. These adjustment parameters are
implemented through parallel adapter modules, ensuring that each instance of
the shared parameter tensor remains distinct and optimized for its specific
function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet
enables efficient expansion of ViT models. With a 2$\times$ depth-scaled
DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training
from scratch while requiring only one-third of the training epochs,
highlighting its efficiency in scaling ViTs. Beyond image classification, our
method shows significant potential for application in downstream vision areas,
as evidenced by the validation in object detection task.
☆ Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling
This paper presents an automated system for cricket video analysis that
leverages deep learning techniques to extract wicket-taking deliveries, detect
cricket balls, and model ball trajectories. The system employs the YOLOv8
architecture for pitch and ball detection, combined with optical character
recognition (OCR) for scorecard extraction to identify wicket-taking moments.
Through comprehensive image preprocessing, including grayscale transformation,
power transformation, and morphological operations, the system achieves robust
text extraction from video frames. The pitch detection model achieved 99.5%
mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the
ball detection model using transfer learning attained 99.18% mAP50 with 0.968
precision and 0.978 recall. The system enables trajectory modeling on detected
pitches, providing data-driven insights for identifying batting weaknesses.
Experimental results on multiple cricket match videos demonstrate the
effectiveness of this approach for automated cricket analytics, offering
significant potential for coaching and strategic decision-making.
comment: 6 figures, 5 tables, submitted to the 11th IEEE International Women
in Engineering (WIE) Conference on Electrical and Computer Engineering 2025
☆ Bayesian Fully-Connected Tensor Network for Hyperspectral-Multispectral Image Fusion
Tensor decomposition is a powerful tool for data analysis and has been
extensively employed in the field of hyperspectral-multispectral image fusion
(HMF). Existing tensor decomposition-based fusion methods typically rely on
disruptive data vectorization/reshaping or impose rigid constraints on the
arrangement of factor tensors, hindering the preservation of spatial-spectral
structures and the modeling of cross-dimensional correlations. Although recent
advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have
partially alleviated these limitations, the process of reorganizing data into
higher-order tensors still disrupts the intrinsic spatial-spectral structure.
Furthermore, these methods necessitate extensive manual parameter tuning and
exhibit limited robustness against noise and spatial degradation. To alleviate
these issues, we propose the Bayesian FCTN (BFCTN) method. Within this
probabilistic framework, a hierarchical sparse prior that characterizing the
sparsity of physical elements, establishes connections between the factor
tensors. This framework explicitly models the intrinsic physical coupling among
spatial structures, spectral signatures, and local scene homogeneity. For model
learning, we develop a parameter estimation method based on Variational
Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which
significantly reduces the need for manual parameter tuning. Extensive
experiments demonstrate that BFCTN not only achieves state-of-the-art fusion
accuracy and strong robustness but also exhibits practical applicability in
complex real-world scenarios.
☆ Entropy-Enhanced Conformal Features from Ricci Flow for Robust Alzheimer's Disease Classification
Background and Objective: In brain imaging, geometric surface models are
essential for analyzing the 3D shapes of anatomical structures. Alzheimer's
disease (AD) is associated with significant cortical atrophy, making such shape
analysis a valuable diagnostic tool. The objective of this study is to
introduce and validate a novel local surface representation method for the
automated and accurate diagnosis of AD. Methods: The study utilizes T1-weighted
MRI scans from 160 participants (80 AD patients and 80 healthy controls) from
the Alzheimer's Disease Neuroimaging Initiative (ADNI). Cortical surface models
were reconstructed from the MRI data using Freesurfer. Key geometric attributes
were computed from the 3D meshes. Area distortion and conformal factor were
derived using Ricci flow for conformal parameterization, while Gaussian
curvature was calculated directly from the mesh geometry. Shannon entropy was
applied to these three features to create compact and informative feature
vectors. The feature vectors were used to train and evaluate a suite of
classifiers (e.g. XGBoost, MLP, Logistic Regression, etc.). Results:
Statistical significance of performance differences between classifiers was
evaluated using paired Welch's t-test. The method proved highly effective in
distinguishing AD patients from healthy controls. The Multi-Layer Perceptron
(MLP) and Logistic Regression classifiers outperformed all others, achieving an
accuracy and F$_1$ Score of 98.62%. Conclusions: This study confirms that the
entropy of conformally-derived geometric features provides a powerful and
robust metric for cortical morphometry. The high classification accuracy
underscores the method's potential to enhance the study and diagnosis of
Alzheimer's disease, offering a straightforward yet powerful tool for clinical
research applications.
☆ S2AP: Score-space Sharpness Minimization for Adversarial Pruning
Adversarial pruning methods have emerged as a powerful tool for compressing
neural networks while preserving robustness against adversarial attacks. These
methods typically follow a three-step pipeline: (i) pretrain a robust model,
(ii) select a binary mask for weight pruning, and (iii) finetune the pruned
model. To select the binary mask, these methods minimize a robust loss by
assigning an importance score to each weight, and then keep the weights with
the highest scores. However, this score-space optimization can lead to sharp
local minima in the robust loss landscape and, in turn, to an unstable mask
selection, reducing the robustness of adversarial pruning methods. To overcome
this issue, we propose a novel plug-in method for adversarial pruning, termed
Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we
introduce the concept of score-space sharpness minimization, which operates
during the mask search by perturbing importance scores and minimizing the
corresponding robust loss. Extensive experiments across various datasets,
models, and sparsity levels demonstrate that S2AP effectively minimizes
sharpness in score space, stabilizing the mask selection, and ultimately
improving the robustness of adversarial pruning methods.
☆ Cross-Modal Scene Semantic Alignment for Image Complexity Assessment
Yuqing Luo, Yixiao Li, Jiang Liu, Jun Fu, Hadi Amirpour, Guanghui Yue, Baoquan Zhao, Padraig Corcoran, Hantao Liu, Wei Zhou
Image complexity assessment (ICA) is a challenging task in perceptual
evaluation due to the subjective nature of human perception and the inherent
semantic diversity in real-world images. Existing ICA methods predominantly
rely on hand-crafted or shallow convolutional neural network-based features of
a single visual modality, which are insufficient to fully capture the perceived
representations closely related to image complexity. Recently, cross-modal
scene semantic information has been shown to play a crucial role in various
computer vision tasks, particularly those involving perceptual understanding.
However, the exploration of cross-modal scene semantic information in the
context of ICA remains unaddressed. Therefore, in this paper, we propose a
novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which
leverages scene semantic alignment from a cross-modal perspective to enhance
ICA performance, enabling complexity predictions to be more consistent with
subjective human perception. Specifically, the proposed CM-SSA consists of a
complexity regression branch and a scene semantic alignment branch. The
complexity regression branch estimates image complexity levels under the
guidance of the scene semantic alignment branch, while the scene semantic
alignment branch is used to align images with corresponding text prompts that
convey rich scene semantic information by pair-wise learning. Extensive
experiments on several ICA datasets demonstrate that the proposed CM-SSA
significantly outperforms state-of-the-art approaches. Codes are available at
https://github.com/XQ2K/First-Cross-Model-ICA.
comment: 14 pages,2 figures, British Machine Vision Conference
☆ FeatureFool: Zero-Query Fooling of Video Models via Feature Map
Duoxun Tang, Xi Xiao, Guangwu Hu, Kangkang Sun, Xiao Yang, Dongyang Chen, Qing Li, Yongjie Yin, Jiyao Wang
The vulnerability of deep neural networks (DNNs) has been preliminarily
verified. Existing black-box adversarial attacks usually require multi-round
interaction with the model and consume numerous queries, which is impractical
in the real-world and hard to scale to recently emerged Video-LLMs. Moreover,
no attack in the video domain directly leverages feature maps to shift the
clean-video feature space. We therefore propose FeatureFool, a stealthy,
video-domain, zero-query black-box attack that utilizes information extracted
from a DNN to alter the feature space of clean videos. Unlike query-based
methods that rely on iterative interaction, FeatureFool performs a zero-query
attack by directly exploiting DNN-extracted information. This efficient
approach is unprecedented in the video domain. Experiments show that
FeatureFool achieves an attack success rate above 70\% against traditional
video classifiers without any queries. Benefiting from the transferability of
the feature map, it can also craft harmful content and bypass Video-LLM
recognition. Additionally, adversarial videos generated by FeatureFool exhibit
high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the
attack barely perceptible. This paper may contain violent or explicit content.
☆ Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
Uncertainty quantification (UQ) is essential for deploying deep neural
networks in safety-critical settings. Although methods like Deep Ensembles
achieve strong UQ performance, their high computational and memory costs hinder
scalability to large models. We introduce Hydra Ensembles, an efficient
transformer-based ensemble that prunes attention heads to create diverse
members and merges them via a new multi-head attention with grouped
fully-connected layers. This yields a compact model with inference speed close
to a single network, matching or surpassing Deep Ensembles in UQ performance
without retraining from scratch. We also provide an in-depth analysis of
pruning, showing that naive approaches can harm calibration, whereas Hydra
Ensembles preserves robust uncertainty. Experiments on image and text
classification tasks, with various architectures, show consistent gains over
Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our
approach surpasses state of the art methods, even without requiring additional
training.
☆ Learning Human-Object Interaction as Groups
Human-Object Interaction Detection (HOI-DET) aims to localize human-object
pairs and identify their interactive relationships. To aggregate contextual
cues, existing methods typically propagate information across all detected
entities via self-attention mechanisms, or establish message passing between
humans and objects with bipartite graphs. However, they primarily focus on
pairwise relationships, overlooking that interactions in real-world scenarios
often emerge from collective behaviors (multiple humans and objects engaging in
joint activities). In light of this, we revisit relation modeling from a group
view and propose GroupHOI, a framework that propagates contextual information
in terms of geometric proximity and semantic similarity. To exploit the
geometric proximity, humans and objects are grouped into distinct clusters
using a learnable proximity estimator based on spatial features derived from
bounding boxes. In each group, a soft correspondence is computed via
self-attention to aggregate and dispatch contextual cues. To incorporate the
semantic similarity, we enhance the vanilla transformer-based interaction
decoder with local contextual cues from HO-pair features. Extensive experiments
on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over
the state-of-the-art methods. It also exhibits leading performance on the more
challenging Nonverbal Interaction Detection (NVI-DET) task, which involves
varied forms of higher-order interactions within groups.
☆ Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Direct preference optimization (DPO) methods have shown strong potential in
aligning text-to-image diffusion models with human preferences by training on
paired comparisons. These methods improve training stability by avoiding the
REINFORCE algorithm but still struggle with challenges such as accurately
estimating image probabilities due to the non-linear nature of the sigmoid
function and the limited diversity of offline datasets. In this paper, we
introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new
preference learning framework grounded in inverse reinforcement learning.
Diffusion-DRO removes the dependency on a reward model by casting preference
learning as a ranking problem, thereby simplifying the training objective into
a denoising formulation and overcoming the non-linear estimation issues found
in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert
demonstrations with online policy-generated negative samples, enabling it to
effectively capture human preferences while addressing the limitations of
offline data. Comprehensive experiments show that Diffusion-DRO delivers
improved generation quality across a range of challenging and unseen prompts,
outperforming state-of-the-art baselines in both both quantitative metrics and
user studies. Our source code and pre-trained models are available at
https://github.com/basiclab/DiffusionDRO.
☆ AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
Audio-Visual Question Answering (AVQA) requires models to effectively utilize
both visual and auditory modalities to answer complex and diverse questions
about audio-visual scenes. However, existing methods lack sufficient
flexibility and dynamic adaptability in temporal sampling and modality
preference awareness, making it difficult to focus on key information based on
the question. This limits their reasoning capability in complex scenarios. To
address these challenges, we propose a novel framework named AV-Master. It
enhances the model's ability to extract key information from complex
audio-visual scenes with substantial redundant content by dynamically modeling
both temporal and modality dimensions. In the temporal dimension, we introduce
a dynamic adaptive focus sampling mechanism that progressively focuses on
audio-visual segments most relevant to the question, effectively mitigating
redundancy and segment fragmentation in traditional sampling methods. In the
modality dimension, we propose a preference-aware strategy that models each
modality's contribution independently, enabling selective activation of
critical features. Furthermore, we introduce a dual-path contrastive loss to
reinforce consistency and complementarity across temporal and modality
dimensions, guiding the model to learn question-specific cross-modal
collaborative representations. Experiments on four large-scale benchmarks show
that AV-Master significantly outperforms existing methods, especially in
complex reasoning tasks.
comment: 13 pages, 9 figures
☆ GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
Compared to the prosperity of pre-training models in natural image
understanding, the research on large-scale pre-training models for facial
knowledge learning is still limited. Current approaches mainly rely on manually
assembled and annotated face datasets for training, but labeling such datasets
is labor-intensive and the trained models have limited scalability beyond the
training data. To address these limitations, we present a generative
pre-training model for facial knowledge learning that leverages large-scale
web-built data for training. We use texts and images containing human faces
crawled from the internet and conduct pre-training on self-supervised tasks,
including masked image/language modeling (MILM) and image-text matching (ITM).
During the generation stage, we further utilize the image-text matching loss to
pull the generation distribution towards the control signal for controllable
image/text generation. Experimental results demonstrate that our model achieves
comparable performance to state-of-the-art pre-training models for various
facial downstream tasks, such as attribution classification and expression
recognition. Furthermore, our approach is also applicable to a wide range of
face editing tasks, including face attribute editing, expression manipulation,
mask removal, and photo inpainting.
comment: This work was initially drafted in November 2022
☆ ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation
Realistic view extrapolation is critical for closed-loop simulation in
autonomous driving, yet it remains a significant challenge for current Novel
View Synthesis (NVS) methods, which often produce distorted and inconsistent
images beyond the original trajectory. This report presents our winning
solution which ctook first place in the RealADSim Workshop NVS track at ICCV
2025. To address the core challenges of street view extrapolation, we introduce
a comprehensive four-stage pipeline. First, we employ a data-driven
initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding
local minima. Second, we inject strong geometric priors by modeling the road
surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a
generative prior to create pseudo ground truth for extrapolated viewpoints,
providing auxilary supervision. Finally, a data-driven adaptation network
removes time-specific artifacts. On the RealADSim-NVS benchmark, our method
achieves a final score of 0.441, ranking first among all participants.
☆ Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net
The increasing frequency of natural and human-induced disasters necessitates
advanced visual recognition techniques capable of analyzing critical
photographic data. With progress in artificial intelligence and resilient
computational systems, rapid and accurate disaster classification has become
crucial for efficient rescue operations. However, visual recognition in
disaster contexts faces significant challenges due to limited and diverse data
from the difficulties in collecting and curating comprehensive, high-quality
disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data
scarcity, yet current FSL research mainly relies on generic benchmark datasets
lacking remote-sensing disaster imagery, limiting its practical effectiveness.
Moreover, disaster images exhibit high intra-class variation and inter-class
similarity, hindering the performance of conventional metric-based FSL methods.
To address these issues, this paper introduces the Attention-based
Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which
linearly combines the Bhattacharyya coefficient and Hellinger distances to
compare and aggregate feature probability distributions for robust prototype
formation. The Bhattacharyya coefficient serves as a contrastive margin that
enhances inter-class separability, while the Hellinger distance regularizes
same-class alignment. This framework parallels contrastive learning but
operates over probability distributions rather than embedded feature points.
Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is
proposed as a distributional counterpart to cosine similarity loss, used
jointly with categorical cross-entropy to significantly improve FSL
performance. Experiments on four FSL benchmarks and two disaster image datasets
demonstrate the superior effectiveness and generalization of ATTBHFA-Net
compared to existing approaches.
comment: Submitted to a SN journal
☆ Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
Large Vision-Language Models (LVLMs) have recently achieved impressive
results in multimodal tasks such as image captioning and visual question
answering. However, they remain prone to object hallucination -- generating
descriptions of nonexistent or misidentified objects. Prior work has partially
mitigated this via auxiliary training objectives or external modules, but
challenges remain in terms of scalability, adaptability, and model
independence. To address these limitations, we propose Adaptive Token Ensemble
Decoding (ATED), a training-free, token-level ensemble framework that mitigates
hallucination by aggregating predictions from multiple LVLMs during inference.
ATED dynamically computes uncertainty-based weights for each model, reflecting
their reliability at each decoding step. It also integrates diverse decoding
paths to improve contextual grounding and semantic consistency. Experiments on
standard hallucination detection benchmarks demonstrate that ATED significantly
outperforms state-of-the-art methods, reducing hallucination without
compromising fluency or relevance. Our findings highlight the benefits of
adaptive ensembling and point to a promising direction for improving LVLM
robustness in high-stakes applications. The code is available at
https://github.com/jinlin2021/ATED.
☆ OmniNWM: Omniscient Driving Navigation World Models
Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, Xin Jin
Autonomous driving world models are expected to work effectively across three
core dimensions: state, action, and reward. Existing models, however, are
typically restricted to limited state modalities, short video sequences,
imprecise action control, and a lack of reward awareness. In this paper, we
introduce OmniNWM, an omniscient panoramic navigation world model that
addresses all three dimensions within a unified framework. For state, OmniNWM
jointly generates panoramic videos of RGB, semantics, metric depth, and 3D
occupancy. A flexible forcing strategy enables high-quality long-horizon
auto-regressive generation. For action, we introduce a normalized panoramic
Plucker ray-map representation that encodes input trajectories into pixel-level
signals, enabling highly precise and generalizable control over panoramic video
generation. Regarding reward, we move beyond learning reward functions with
external image-based models: instead, we leverage the generated 3D occupancy to
directly define rule-based dense rewards for driving compliance and safety.
Extensive experiments demonstrate that OmniNWM achieves state-of-the-art
performance in video generation, control accuracy, and long-horizon stability,
while providing a reliable closed-loop evaluation framework through
occupancy-grounded rewards. Project page is available at
https://github.com/Arlo0o/OmniNWM.
comment: https://arlo0o.github.io/OmniNWM/
☆ The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Imaging technologies are fundamental to biomedical research and modern
medicine, requiring analysis of high-resolution images across various
modalities. While multimodal large language models (MLLMs) show promise for
biomedical image analysis, most are designed for low-resolution images from
general-purpose datasets, risking critical information loss. We investigate how
image resolution affects MLLM performance in biomedical applications and
demonstrate that: (1) native-resolution training and inference significantly
improve performance across multiple tasks, (2) misalignment between training
and inference resolutions severely degrades performance, and (3)
mixed-resolution training effectively mitigates misalignment and balances
computational constraints with performance requirements. Based on these
findings, we recommend prioritizing native-resolution inference and
mixed-resolution datasets to optimize biomedical MLLMs for transformative
impact in scientific research and clinical applications.
comment: Proceedings of the 10th Machine Learning for Healthcare Conference,
PMLR 298, 2025
☆ Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
Incentivizing the reasoning ability of Multimodal Large Language Models
(MLLMs) is essential for medical applications to transparently analyze medical
scans and provide reliable diagnosis. However, existing medical MLLMs rely
solely on internal knowledge during reasoning, leading to hallucinated
reasoning and factual inaccuracies when encountering cases beyond their
training scope. Although recent Agentic Retrieval-Augmented Generation (RAG)
methods elicit the medical model's proactive retrieval ability during
reasoning, they are confined to unimodal LLMs, neglecting the crucial visual
information during reasoning and retrieval. Consequently, we propose the first
Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively
retrieves external knowledge by querying observed symptoms or domain-specific
medical concepts during reasoning. Specifically, we design a two-stage
reinforcement learning strategy with tailored rewards that stimulate the model
to leverage both visual diagnostic findings and textual clinical information
for effective retrieval. Building on this foundation, we further propose a
Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when
low prediction confidence is detected. Evaluation on various public medical
benchmarks demonstrates Med-RwR's significant improvements over baseline
models, proving the effectiveness of enhancing reasoning capabilities with
external knowledge integration. Furthermore, Med-RwR demonstrates remarkable
generalizability to unfamiliar domains, evidenced by 8.8% performance gain on
our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of
echocardiography data in the training corpus. Our data, model, and codes will
be made publicly available at https://github.com/xmed-lab/Med-RwR.
comment: Work in progress
☆ GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation ICCV
We introduce a novel framework for metric depth estimation that enhances
pretrained diffusion-based monocular depth estimation (DB-MDE) models with
stereo vision guidance. While existing DB-MDE methods excel at predicting
relative depth, estimating absolute metric depth remains challenging due to
scale ambiguities in single-image scenarios. To address this, we reframe depth
estimation as an inverse problem, leveraging pretrained latent diffusion models
(LDMs) conditioned on RGB images, combined with stereo-based geometric
constraints, to learn scale and shift for accurate depth recovery. Our
training-free solution seamlessly integrates into existing DB-MDE frameworks
and generalizes across indoor, outdoor, and complex environments. Extensive
experiments demonstrate that our approach matches or surpasses state-of-the-art
methods, particularly in challenging scenarios involving translucent and
specular surfaces, all without requiring retraining.
comment: Accepted to ICCV Findings 2025. The first two authors contributed
equally. The last two authors share co-corresponding authorship
☆ Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
Identity preserving editing of faces is a generative task that enables
modifying the illumination, adding/removing eyeglasses, face aging, editing
hairstyles, modifying expression etc., while preserving the identity of the
face. Recent progress in 2D generative models have enabled photorealistic
editing of faces using simple techniques leveraging the compositionality in
GANs. However, identity preserving editing for 3D faces with a given set of
attributes is a challenging task as the generative model must reason about view
consistency from multiple poses and render a realistic 3D face. Further, 3D
portrait editing requires large-scale attribute labelled datasets and presents
a trade-off between editability in low-resolution and inflexibility to editing
in high resolution. In this work, we aim to alleviate some of the constraints
in editing 3D faces by identifying latent space directions that correspond to
photorealistic edits. To address this, we present a method that builds on
recent advancements in 3D-aware deep generative models and 2D portrait editing
techniques to perform efficient few-shot identity preserving attribute editing
for 3D-aware generative models. We aim to show from experimental results that
using just ten or fewer labelled images of an attribute is sufficient to
estimate edit directions in the latent space that correspond to 3D-aware
attribute editing. In this work, we leverage an existing face dataset with
masks to obtain the synthetic images for few attribute examples required for
estimating the edit directions. Further, to demonstrate the linearity of edits,
we investigate one-shot stylization by performing sequential editing and use
the (2D) Attribute Style Manipulation (ASM) technique to investigate a
continuous style manifold for 3D consistent identity preserving face aging.
Code and results are available at: https://vishal-vinod.github.io/gmpi-edit/
comment: 14 pages, 7 figures
☆ StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Unlike offline processing, streaming video vision-language models face two
fundamental constraints: causality and accumulation. Causality prevents access
to future frames that offline methods exploit, while accumulation causes tokens
to grow unbounded, creating efficiency bottlenecks. However, existing
approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill
unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage
framework that addresses both pre-LLM and post-LLM bottlenecks with predictable
latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects
tokens based on adjacent-frame changes and token saliency, drastically reducing
per-frame prefill cost by processing only a compact subset of visual tokens per
frame instead of all visual tokens. Online Quantized Memory stores tokens in
4-bit format, retrieves relevant groups on demand, and dequantizes them,
keeping the active kv-cache bounded regardless of stream length. Experiments
demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$
lower peak memory and $2\times$ faster TTFT compared to prior SOTA.
StreamingTOM maintains state-of-the-art accuracy among training-free methods
with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS.
These results highlight the practical benefits of our two-stage approach for
efficient streaming video understanding with bounded growth.
☆ TreeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation
In medical image segmentation tasks, Domain Generalization (DG) under the
Federated Learning (FL) framework is crucial for addressing challenges related
to privacy protection and data heterogeneity. However, traditional federated
learning methods fail to account for the imbalance in information aggregation
across clients in cross-domain scenarios, leading to the Global Drift (GD)
problem and a consequent decline in model generalization performance. This
motivates us to delve deeper and define a new critical issue: global drift in
federated domain generalization for medical imaging (FedDG-GD). In this paper,
we propose a novel tree topology framework called TreeFedDG. First, starting
from the distributed characteristics of medical images, we design a
hierarchical parameter aggregation method based on a tree-structured topology
to suppress deviations in the global model direction. Second, we introduce a
parameter difference-based style mixing method (FedStyle), which enforces
mixing among clients with maximum parameter differences to enhance robustness
against drift. Third, we develop a a progressive personalized fusion strategy
during model distribution, ensuring a balance between knowledge transfer and
personalized features. Finally, during the inference phase, we use feature
similarity to guide the retrieval of the most relevant model chain from the
tree structure for ensemble decision-making, thereby fully leveraging the
advantages of hierarchical knowledge. We conducted extensive experiments on two
publicly available datasets. The results demonstrate that our method
outperforms other state-of-the-art domain generalization approaches in these
challenging tasks and achieves better balance in cross-domain performance.
☆ Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization ICME2025
Existing 3D human mesh recovery methods often fail to fully exploit the
latent information (e.g., human motion, shape alignment), leading to issues
with limb misalignment and insufficient local details in the reconstructed
human mesh (especially in complex scenes). Furthermore, the performance
improvement gained by modelling mesh vertices and pose node interactions using
attention mechanisms comes at a high computational cost. To address these
issues, we propose a two-stage network for human mesh recovery based on latent
information and low dimensional learning. Specifically, the first stage of the
network fully excavates global (e.g., the overall shape alignment) and local
(e.g., textures, detail) information from the low and high-frequency components
of image features and aggregates this information into a hybrid latent
frequency domain feature. This strategy effectively extracts latent
information. Subsequently, utilizing extracted hybrid latent frequency domain
features collaborates to enhance 2D poses to 3D learning. In the second stage,
with the assistance of hybrid latent features, we model the interaction
learning between the rough 3D human mesh template and the 3D pose, optimizing
the pose and shape of the human mesh. Unlike existing mesh pose interaction
methods, we design a low-dimensional mesh pose interaction method through
dimensionality reduction and parallel optimization that significantly reduces
computational costs without sacrificing reconstruction accuracy. Extensive
experimental results on large publicly available datasets indicate superiority
compared to the most state-of-the-art.
comment: Accepted by ICME2025
☆ From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
Subject-driven image generation models face a fundamental trade-off between
identity preservation (fidelity) and prompt adherence (editability). While
online reinforcement learning (RL), specifically GPRO, offers a promising
solution, we find that a naive application of GRPO leads to competitive
degradation, as the simple linear aggregation of rewards with static weights
causes conflicting gradient signals and a misalignment with the temporal
dynamics of the diffusion process. To overcome these limitations, we propose
Customized-GRPO, a novel framework featuring two key innovations: (i)
Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly
penalizes conflicted reward signals and amplifies synergistic ones, providing a
sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW),
which aligns the optimization pressure with the model's temporal dynamics by
prioritizing prompt-following in the early, identity preservation in the later.
Extensive experiments demonstrate that our method significantly outperforms
naive GRPO baselines, successfully mitigating competitive degradation. Our
model achieves a superior balance, generating images that both preserve key
identity features and accurately adhere to complex textual prompts.
☆ UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
Large vision-language models (VLMs) have achieved remarkable success in
natural scene understanding, yet their application to underwater environments
remains largely unexplored. Underwater imagery presents unique challenges
including severe light attenuation, color distortion, and suspended particle
scattering, while requiring specialized knowledge of marine ecosystems and
organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive
benchmark specifically designed for underwater vision-language understanding.
UWBench comprises 15,003 high-resolution underwater images captured across
diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea
habitats. Each image is enriched with human-verified annotations including
15,281 object referring expressions that precisely describe marine organisms
and underwater structures, and 124,983 question-answer pairs covering diverse
reasoning capabilities from object recognition to ecological relationship
understanding. The dataset captures rich variations in visibility, lighting
conditions, and water turbidity, providing a realistic testbed for model
evaluation. Based on UWBench, we establish three comprehensive benchmarks:
detailed image captioning for generating ecologically informed scene
descriptions, visual grounding for precise localization of marine organisms,
and visual question answering for multimodal reasoning about underwater
environments. Extensive experiments on state-of-the-art VLMs demonstrate that
underwater understanding remains challenging, with substantial room for
improvement. Our benchmark provides essential resources for advancing
vision-language research in underwater contexts and supporting applications in
marine science, ecological monitoring, and autonomous underwater exploration.
Our code and benchmark will be available.
comment: We have released V1, which only reports the test results. Our work is
still ongoing, and the next version will be coming soon
☆ Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery ICME2025
3D human meshes show a natural hierarchical structure (like
torso-limbs-fingers). But existing video-based 3D human mesh recovery methods
usually learn mesh features in Euclidean space. It's hard to catch this
hierarchical structure accurately. So wrong human meshes are reconstructed. To
solve this problem, we propose a hyperbolic space learning method leveraging
temporal motion prior for recovering 3D human meshes from videos. First, we
design a temporal motion prior extraction module. This module extracts the
temporal motion features from the input 3D pose sequences and image feature
sequences respectively. Then it combines them into the temporal motion prior.
In this way, it can strengthen the ability to express features in the temporal
motion dimension. Since data representation in non-Euclidean space has been
proved to effectively capture hierarchical relationships in real-world datasets
(especially in hyperbolic space), we further design a hyperbolic space
optimization learning strategy. This strategy uses the temporal motion prior
information to assist learning, and uses 3D pose and pose motion information
respectively in the hyperbolic space to optimize and learn the mesh features.
Then, we combine the optimized results to get an accurate and smooth human
mesh. Besides, to make the optimization learning process of human meshes in
hyperbolic space stable and effective, we propose a hyperbolic mesh
optimization loss. Extensive experimental results on large publicly available
datasets indicate superiority in comparison with most state-of-the-art.
comment: Accepted by ICME2025
☆ OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
Understanding 3D scenes is pivotal for autonomous driving, robotics, and
augmented reality. Recent semantic Gaussian Splatting approaches leverage
large-scale 2D vision models to project 2D semantic features onto 3D scenes.
However, they suffer from two major limitations: (1) insufficient contextual
cues for individual masks during preprocessing and (2) inconsistencies and
missing details when fusing multi-view features from these 2D models. In this
paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary
\textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware
Cross-view Fusion. Our method consists of two modules: Context-Aware Feature
Extraction, which augments each mask with rich semantic context, and
Attention-Driven Feature Aggregation, which selectively fuses multi-view
features to mitigate alignment errors and incompleteness. Through extensive
experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art
results in open-vocabulary 3D Gaussian segmentation, outperforming existing
baselines by a large margin. These findings underscore the robustness and
generality of our proposed approach, marking a significant step forward in 3D
scene understanding and its practical deployment across diverse real-world
scenarios.
☆ BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
Zero-shot 3D object classification is crucial for real-world applications
like autonomous driving, however it is often hindered by a significant domain
gap between the synthetic data used for training and the sparse, noisy LiDAR
scans encountered in the real-world. Current methods trained solely on
synthetic data fail to generalize to outdoor scenes, while those trained only
on real data lack the semantic diversity to recognize rare or unseen objects.
We introduce BlendCLIP, a multimodal pretraining framework that bridges this
synthetic-to-real gap by strategically combining the strengths of both domains.
We first propose a pipeline to generate a large-scale dataset of object-level
triplets -- consisting of a point cloud, image, and text description -- mined
directly from real-world driving data and human annotated 3D boxes. Our core
contribution is a curriculum-based data mixing strategy that first grounds the
model in the semantically rich synthetic CAD data before progressively adapting
it to the specific characteristics of real-world scans.
Our experiments show that our approach is highly label-efficient: introducing
as few as 1.5\% real-world samples per batch into training boosts zero-shot
accuracy on the nuScenes benchmark by 27\%. Consequently, our final model
achieves state-of-the-art performance on challenging outdoor datasets like
nuScenes and TruckScenes, improving over the best prior method by 19.3\% on
nuScenes, while maintaining strong generalization on diverse synthetic
benchmarks. Our findings demonstrate that effective domain adaptation, not
full-scale real-world annotation, is the key to unlocking robust
open-vocabulary 3D perception. Our code and dataset will be released upon
acceptance on https://github.com/kesu1/BlendCLIP.
comment: Under Review
☆ DeepSeek-OCR: Contexts Optical Compression
We present DeepSeek-OCR as an initial investigation into the feasibility of
compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two
components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically,
DeepEncoder serves as the core engine, designed to maintain low activations
under high-resolution input while achieving high compression ratios to ensure
an optimal and manageable number of vision tokens. Experiments show that when
the number of text tokens is within 10 times that of vision tokens (i.e., a
compression ratio < 10x), the model can achieve decoding (OCR) precision of
97%. Even at a compression ratio of 20x, the OCR accuracy still remains at
about 60%. This shows considerable promise for research areas such as
historical long-context compression and memory forgetting mechanisms in LLMs.
Beyond this, DeepSeek-OCR also demonstrates high practical value. On
OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision
tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while
utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can
generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a
single A100-40G). Codes and model weights are publicly accessible at
http://github.com/deepseek-ai/DeepSeek-OCR.
☆ Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis
This paper presents a generation-based debiasing framework for object
detection. Prior debiasing methods are often limited by the representation
diversity of samples, while naive generative augmentation often preserves the
biases it aims to solve. Moreover, our analysis reveals that simply generating
more data for rare classes is suboptimal due to two core issues: i) instance
frequency is an incomplete proxy for the true data needs of a model, and ii)
current layout-to-image synthesis lacks the fidelity and control to generate
high-quality, complex scenes. To overcome this, we introduce the representation
score (RS) to diagnose representational gaps beyond mere frequency, guiding the
creation of new, unbiased layouts. To ensure high-quality synthesis, we replace
ambiguous text prompts with a precise visual blueprint and employ a generative
alignment strategy, which fosters communication between the detector and
generator. Our method significantly narrows the performance gap for
underrepresented object groups, \eg, improving large/rare instances by 4.4/3.6
mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP
for layout accuracy in generated images.
☆ DualHash: A Stochastic Primal-Dual Algorithm with Theoretical Guarantee for Deep Hashing
Deep hashing converts high-dimensional feature vectors into compact binary
codes, enabling efficient large-scale retrieval. A fundamental challenge in
deep hashing stems from the discrete nature of quantization in generating the
codes. W-type regularizations, such as $||z|-1|$, have been proven effective as
they encourage variables toward binary values. However, existing methods often
directly optimize these regularizations without convergence guarantees. While
proximal gradient methods offer a promising solution, the coupling between
W-type regularizers and neural network outputs results in composite forms that
generally lack closed-form proximal solutions. In this paper, we present a
stochastic primal-dual hashing algorithm, referred to as DualHash, that
provides rigorous complexity bounds. Using Fenchel duality, we partially
transform the nonconvex W-type regularization optimization into the dual space,
which results in a proximal operator that admits closed-form solutions. We
derive two algorithm instances: a momentum-accelerated version with
$\mathcal{O}(\varepsilon^{-4})$ complexity and an improved
$\mathcal{O}(\varepsilon^{-3})$ version using variance reduction. Experiments
on three image retrieval databases demonstrate the superior performance of
DualHash.
☆ VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng
Safety evaluation of multimodal foundation models often treats vision and
language inputs separately, missing risks from joint interpretation where
benign content becomes harmful in combination. Existing approaches also fail to
distinguish clearly unsafe content from borderline cases, leading to
problematic over-blocking or under-refusal of genuinely harmful content. We
present Vision Language Safety Understanding (VLSU), a comprehensive framework
to systematically evaluate multimodal safety through fine-grained severity
classification and combinatorial analysis across 17 distinct safety patterns.
Using a multi-stage pipeline with real-world images and human annotation, we
construct a large-scale benchmark of 8,187 samples spanning 15 harm categories.
Our evaluation of eleven state-of-the-art models reveals systematic joint
understanding failures: while models achieve 90%-plus accuracy on clear
unimodal safety signals, performance degrades substantially to 20-55% when
joint image-text reasoning is required to determine the safety label. Most
critically, 34% of errors in joint image-text safety classification occur
despite correct classification of the individual modalities, further
demonstrating absent compositional reasoning capabilities. Additionally, we
find that models struggle to balance refusing unsafe content while still
responding to borderline cases that deserve engagement. For example, we find
that instruction framing can reduce the over-blocking rate on borderline
content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of
under-refusing on unsafe content with refusal rate dropping from 90.8% to
53.9%. Overall, our framework exposes weaknesses in joint image-text
understanding and alignment gaps in current models, and provides a critical
test bed to enable the next milestones in research on robust vision-language
safety.
comment: 10 pages, 5 figures, 4 tables. Under review
☆ EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation
Papillary thyroid microcarcinoma (PTMC) is increasingly managed with
radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound
videos remains difficult due to low contrast, probe-induced motion, and
heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes
well to static images, but its frame-independent design yields unstable
predictions and temporal drift in interventional ultrasound. We introduce
\textbf{EMA-SAM}, a lightweight extension of SAM-2 that incorporates a
confidence-weighted exponential moving average pointer into the memory bank,
providing a stable latent prototype of the tumour across frames. This design
preserves temporal coherence through probe pressure and bubble occlusion while
rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset
(124 minutes, 13 patients), EMA-SAM improves \emph{maxDice} from 0.82 (SAM-2)
to 0.86 and \emph{maxIoU} from 0.72 to 0.76, while reducing false positives by
29\%. On external benchmarks, including VTUS and colonoscopy video polyp
datasets, EMA-SAM achieves consistent gains of 2--5 Dice points over SAM-2.
Importantly, the EMA pointer adds \textless0.1\% FLOPs, preserving real-time
throughput of $\sim$30\,FPS on a single A100 GPU. These results establish
EMA-SAM as a robust and efficient framework for stable tumour tracking,
bridging the gap between foundation models and the stringent demands of
interventional ultrasound. Codes are available here \hyperref[code
{https://github.com/mdialameh/EMA-SAM}.
☆ FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo
Fair, transparent, and explainable decision-making remains a critical
challenge in Olympic and Paralympic combat sports. This paper presents
\emph{FST.ai 2.0}, an explainable AI ecosystem designed to support referees,
coaches, and athletes in real time during Taekwondo competitions and training.
The system integrates {pose-based action recognition} using graph convolutional
networks (GCNs), {epistemic uncertainty modeling} through credal sets, and
{explainability overlays} for visual decision support. A set of {interactive
dashboards} enables human--AI collaboration in referee evaluation, athlete
performance analysis, and Para-Taekwondo classification. Beyond automated
scoring, FST.ai~2.0 incorporates modules for referee training, fairness
monitoring, and policy-level analytics within the World Taekwondo ecosystem.
Experimental validation on competition data demonstrates an {85\% reduction in
decision review time} and {93\% referee trust} in AI-assisted decisions. The
framework thus establishes a transparent and extensible pipeline for
trustworthy, data-driven officiating and athlete assessment. By bridging
real-time perception, explainable inference, and governance-aware design,
FST.ai~2.0 represents a step toward equitable, accountable, and human-aligned
AI in sports.
comment: 23 pages, 12 figures
☆ A Generalizable Light Transport 3D Embedding for Global Illumination
Bing Xu, Mukund Varma T, Cheng Wang, Tzumao Li, Lifan Wu, Bartlomiej Wronski, Ravi Ramamoorthi, Marco Salvi
Global illumination (GI) is essential for realistic rendering but remains
computationally expensive due to the complexity of simulating indirect light
transport. Recent neural methods have mainly relied on per-scene optimization,
sometimes extended to handle changes in camera or geometry. Efforts toward
cross-scene generalization have largely stayed in 2D screen space, such as
neural denoising or G-buffer based GI prediction, which often suffer from view
inconsistency and limited spatial understanding. We propose a generalizable 3D
light transport embedding that approximates global illumination directly from
3D scene configurations, without using rasterized or path-traced cues. Each
scene is represented as a point cloud with geometric and material features. A
scalable transformer models global point-to-point interactions to encode these
features into neural primitives. At render time, each query point retrieves
nearby primitives via nearest-neighbor search and aggregates their latent
features through cross-attention to predict the desired rendering quantity. We
demonstrate results on diffuse global illumination prediction across diverse
indoor scenes with varying layouts, geometry, and materials. The embedding
trained for irradiance estimation can be quickly adapted to new rendering tasks
with limited fine-tuning. We also present preliminary results for
spatial-directional radiance field estimation for glossy materials and show how
the normalized field can accelerate unbiased path guiding. This approach
highlights a path toward integrating learned priors into rendering pipelines
without explicit ray-traced illumination cues.
☆ RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology
Chengrun Li, Corentin Royer, Haozhe Luo, Bastian Wittmann, Xia Li, Ibrahim Hamamci, Sezgin Er, Anjany Sekuboyina, Bjoern Menze
Most current medical vision language models struggle to jointly generate
diagnostic text and pixel-level segmentation masks in response to complex
visual questions. This represents a major limitation towards clinical
application, as assistive systems that fail to provide both modalities
simultaneously offer limited value to medical practitioners. To alleviate this
limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality
detection, diagnosis, and multi-target segmentation into a unified and
hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is
precisely designed to support the development of models that produce
descriptive text and corresponding segmentation masks in tandem. Subsequently,
we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M,
capable of joint abnormality detection, diagnosis, and flexible segmentation.
RadDiagSeg-M provides highly informative and clinically useful outputs,
effectively addressing the need to enrich contextual information for assistive
diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong
performance across all components involved in the task of multi-target
text-and-mask generation, establishing a robust and competitive baseline.
☆ VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis
Fatima AlGhamdi, Omar Alharbi, Abdullah Aldwyish, Raied Aljadaany, Muhammad Kamran J Khan, Huda Alamri
Detecting anomalies in crowded scenes is challenging due to severe
inter-person occlusions and highly dynamic, context-dependent motion patterns.
Existing approaches often struggle to adapt to varying crowd densities and lack
interpretable anomaly indicators. To address these limitations, we introduce
VelocityNet, a dual-pipeline framework that combines head detection and dense
optical flow to extract person-specific velocities. Hierarchical clustering
categorizes these velocities into semantic motion classes (halt, slow, normal,
and fast), and a percentile-based anomaly scoring system measures deviations
from learned normal patterns. Experiments demonstrate the effectiveness of our
framework in real-time detection of diverse anomalous motion patterns within
densely crowded environments.
comment: 8 pages, 3 figures
♻ ☆ Glyph: Scaling Context Windows via Visual-Text Compression
Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang
Large language models (LLMs) increasingly rely on long-context modeling for
tasks such as document understanding, code analysis, and multi-step reasoning.
However, scaling context windows to the million-token level brings prohibitive
computational and memory costs, limiting the practicality of long-context LLMs.
In this work, we take a different perspective-visual context scaling-to tackle
this challenge. Instead of extending token-based sequences, we propose Glyph, a
framework that renders long texts into images and processes them with
vision-language models (VLMs). This approach substantially compresses textual
input while preserving semantic information, and we further design an
LLM-driven genetic search to identify optimal visual rendering configurations
for balancing accuracy and compression. Through extensive experiments, we
demonstrate that our method achieves 3-4x token compression while maintaining
accuracy comparable to leading LLMs such as Qwen3-8B on various long-context
benchmarks. This compression also leads to around 4x faster prefilling and
decoding, and approximately 2x faster SFT training. Furthermore, under extreme
compression, a 128K-context VLM could scale to handle 1M-token-level text
tasks. In addition, the rendered text data benefits real-world multimodal
tasks, such as document understanding. Our code and model are released at
https://github.com/thu-coai/Glyph.
♻ ☆ PICABench: How Far Are We from Physically Realistic Image Editing?
Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu
Image editing has achieved remarkable progress recently. Modern editing
models could already follow complex instructions to manipulate the original
content. However, beyond completing the editing instructions, the accompanying
physical effects are the key to the generation realism. For example, removing
an object should also remove its shadow, reflections, and interactions with
nearby objects. Unfortunately, existing models and benchmarks mainly focus on
instruction completion but overlook these physical effects. So, at this moment,
how far are we from physically realistic image editing? To answer this, we
introduce PICABench, which systematically evaluates physical realism across
eight sub-dimension (spanning optics, mechanics, and state transitions) for
most of the common editing operations (add, remove, attribute change, etc.). We
further propose the PICAEval, a reliable evaluation protocol that uses
VLM-as-a-judge with per-case, region-level human annotations and questions.
Beyond benchmarking, we also explore effective solutions by learning physics
from videos and construct a training dataset PICA-100K. After evaluating most
of the mainstream models, we observe that physical realism remains a
challenging problem with large rooms to explore. We hope that our benchmark and
proposed solutions can serve as a foundation for future work moving from naive
content editing toward physically consistent realism.
♻ ☆ CaMiT: A Time-Aware Car Model Dataset for Classification and Generation NeurIPS 2025
AI systems must adapt to evolving visual environments, especially in domains
where object appearances change over time. We introduce Car Models in Time
(CaMiT), a fine-grained dataset capturing the temporal evolution of car models,
a representative class of technological artifacts. CaMiT includes 787K labeled
samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023),
supporting both supervised and self-supervised learning. Static pretraining on
in-domain data achieves competitive performance with large-scale generalist
models while being more resource-efficient, yet accuracy declines when models
are tested across years. To address this, we propose a time-incremental
classification setting, a realistic continual learning scenario with emerging,
evolving, and disappearing classes. We evaluate two strategies:
time-incremental pretraining, which updates the backbone, and time-incremental
classifier learning, which updates only the final layer, both improving
temporal robustness. Finally, we explore time-aware image generation that
leverages temporal metadata during training, yielding more realistic outputs.
CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained
visual recognition and generation.
comment: To be published in NeurIPS 2025 Track on Datasets and Benchmarks
♻ ☆ Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
With video exploding across social media, surveillance, and education,
compressing long footage into concise yet faithful surrogates is crucial.
Supervised methods learn frame/shot importance from dense labels and excel
in-domain, but are costly and brittle across datasets; unsupervised methods
avoid labels but often miss high-level semantics and narrative cues. Recent
zero-shot pipelines use LLMs for training-free summarization, yet remain
sensitive to handcrafted prompts and dataset-specific normalization.We propose
a rubric-guided, pseudo-labeled prompting framework. A small subset of human
annotations is converted into high-confidence pseudo labels and aggregated into
structured, dataset-adaptive scoring rubrics for interpretable scene
evaluation. At inference, boundary scenes (first/last) are scored from their
own descriptions, while intermediate scenes include brief summaries of adjacent
segments to assess progression and redundancy, enabling the LLM to balance
local salience with global coherence without parameter tuning.Across three
benchmarks, our method is consistently effective. On SumMe and TVSum it
achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21)
by +0.85 and +0.84 and approaching supervised performance. On the query-focused
QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable
across validation videos. These results show that rubric-guided pseudo
labeling, coupled with contextual prompting, stabilizes LLM-based scoring and
yields a general, interpretable zero-shot paradigm for both generic and
query-focused video summarization.
♻ ☆ Leveraging AV1 motion vectors for Fast and Dense Feature Matching
We repurpose AV1 motion vectors to produce dense sub-pixel correspondences
and short tracks filtered by cosine consistency. On short videos, this
compressed-domain front end runs comparably to sequential SIFT while using far
less CPU, and yields denser matches with competitive pairwise geometry. As a
small SfM demo on a 117-frame clip, MV matches register all images and
reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows
with match density. These results show compressed-domain correspondences are a
practical, resource-efficient front end with clear paths to scaling in full
pipelines.
comment: Accepted ICIR 2025, camera-ready version
♻ ☆ Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi
Automated parsing of scanned documents into richly structured,
machine-readable formats remains a critical bottleneck in Document AI, as
traditional multi-stage pipelines suffer from error propagation and limited
adaptability to diverse layouts. We introduce layoutRL, an end-to-end
reinforcement learning framework that trains models to be explicitly
layout-aware by optimizing a composite reward of normalized edit distance,
paragraph count accuracy, and reading order preservation. Leveraging our newly
released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic
scanned document parsing data with expert-filtered real-world documents, we
instantiate layoutRL in a vision-language-model-based parser called
Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and
formula extraction, and reading order detection, Infinity-Parser achieves new
state-of-the-art performance in both accuracy and structural fidelity,
outpacing specialist pipelines and general-purpose vision-language models. We
will publicly release our code and dataset to accelerate progress in robust
document understanding.
comment: 16 pages, 12 figures
♻ ☆ DeepDetect: Learning All-in-One Dense Keypoints
Keypoint detection is the foundation of many computer vision tasks, including
image registration, structure-from motion, 3D reconstruction, visual odometry,
and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning
based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong
performance yet suffer from key limitations: sensitivity to photometric
changes, low keypoint density and repeatability, limited adaptability to
challenging scenes, and lack of semantic understanding, often failing to
prioritize visually important regions. We present DeepDetect, an intelligent,
all-in-one, dense keypoint detector that unifies the strengths of classical
detectors using deep learning. Firstly, we create ground-truth masks by fusing
outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from
corners and blobs to prominent edges and textures in the images. Afterwards, a
lightweight and efficient model: ESPNet, is trained using these masks as
labels, enabling DeepDetect to focus semantically on images while producing
highly dense keypoints, that are adaptable to diverse and visually degraded
conditions. Evaluations on the Oxford Affine Covariant Regions dataset
demonstrate that DeepDetect surpasses other detectors in keypoint density,
repeatability, and the number of correct matches, achieving maximum values of
0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003
(correct matches).
comment: 6 pages, 6 figures, 2 tables, 7 equations
♻ ☆ Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing
Parkinson's disease (PD) severity diagnosis is crucial for early detecting
potential patients and adopting tailored interventions. Diagnosing PD based on
facial expression is grounded in PD patients' "masked face" symptom and gains
growing interest recently for its convenience and affordability. However,
current facial expression-based approaches often rely on single type of
expression which can lead to misdiagnosis, and ignore the class imbalance
across different PD stages which degrades the prediction performance. Moreover,
most existing methods focus on binary classification (i.e., PD / non-PD) rather
than diagnosing the severity of PD. To address these issues, we propose a new
facial expression-based method for PD severity diagnosis which integrates
multiple facial expression features through attention-based feature fusion.
Moreover, we mitigate the class imbalance problem via an adaptive class
balancing strategy which dynamically adjusts the contribution of training
samples based on their class distribution and classification difficulty.
Experimental results demonstrate the promising performance of the proposed
method for PD severity diagnosis, as well as the efficacy of attention-based
feature fusion and adaptive class balancing.
comment: 3 pages, 2 figures, accepted by MIND 2025
♻ ☆ LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
We introduce \textbf{LongInsightBench}, the first benchmark designed to
assess models' ability to understand long videos, with a focus on human
language, viewpoints, actions, and other contextual elements, while integrating
\textbf{visual, audio, and text} modalities. Our benchmark excels in three key
areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select
approximately 1,000 videos from open-source datasets FineVideo based on
duration limit and the information density of both visual and audio modalities,
focusing on content like lectures, interviews, and vlogs, which contain rich
language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have
designed six challenging task scenarios, including both Intra-Event and
Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance
Pipelines:} We have developed a three-step, semi-automated data quality
assurance pipeline to ensure the difficulty and validity of the synthesized
questions and answer options. Based on LongInsightBench, we designed a series
of experiments. Experimental results shows that Omni-modal models(OLMs) still
face challenge in tasks requiring precise temporal localization (T-Loc) and
long-range causal inference (CE-Caus). Extended experiments reveal the
information loss and processing bias in multi-modal fusion of OLMs. Our dataset
and code is available at
https://anonymous.4open.science/r/LongInsightBench-910F/.
comment: Submitted to ARR Rolling Review
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating
physically plausible trajectories, but often fail to generalize to long-tail
scenarios due to the lack of essential world knowledge to understand and reason
about surrounding environments. In contrast, Vision-Language-Action (VLA)
models leverage world knowledge to handle challenging cases, but their limited
3D reasoning capability can lead to physically infeasible actions. In this work
we introduce DiffVLA++, an enhanced autonomous driving framework that
explicitly bridges cognitive reasoning and E2E planning through metric-guided
alignment. First, we build a VLA module directly generating semantically
grounded driving trajectories. Second, we design an E2E module with a dense
trajectory vocabulary that ensures physical feasibility. Third, and most
critically, we introduce a metric-guided trajectory scorer that guides and
aligns the outputs of the VLA and E2E modules, thereby integrating their
complementary strengths. The experiment on the ICCV 2025 Autonomous Grand
Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Hao Yan, Xingchen Liu, Hao Wang, Zhenbiao Cao, Handong Zheng, Liang Yin, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai
Recent strides in multimodal large language models (MLLMs) have significantly
advanced their performance in many reasoning tasks. However, Abstract Visual
Reasoning (AVR) remains a critical challenge, primarily due to limitations in
perceiving abstract graphics. To tackle this issue, we investigate the
bottlenecks in current MLLMs and synthesize training data to improve their
abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR,
featuring tasks meticulously constructed to assess models' reasoning capacities
across five core dimensions and two high-level reasoning categories. Second, we
introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for
generating riddles with fine-grained perceptual descriptions. PRS not only
generates valuable training data for abstract graphics but also provides
fine-grained perceptual description, crucially allowing for supervision over
intermediate reasoning stages and thereby improving both training efficacy and
model interpretability. Our extensive experimental results on VisuRiddles
empirically validate that fine-grained visual perception is the principal
bottleneck and our synthesis framework markedly enhances the performance of
contemporary MLLMs on these challenging tasks. Our code and dataset will be
released at https://github.com/yh-hust/VisuRiddles
comment: 13 pages, 4 figures
♻ ☆ Latent Diffusion Model without Variational Autoencoder
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu
Recent progress in diffusion-based visual generation has largely relied on
latent diffusion models with variational autoencoders (VAEs). While effective
for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited
training efficiency, slow inference, and poor transferability to broader vision
tasks. These issues stem from a key limitation of VAE latent spaces: the lack
of clear semantic separation and strong discriminative structure. Our analysis
confirms that these properties are crucial not only for perception and
understanding tasks, but also for the stable and efficient training of latent
diffusion models. Motivated by this insight, we introduce SVG, a novel latent
diffusion model without variational autoencoders, which leverages
self-supervised representations for visual generation. SVG constructs a feature
space with clear semantic discriminability by leveraging frozen DINO features,
while a lightweight residual branch captures fine-grained details for
high-fidelity reconstruction. Diffusion models are trained directly on this
semantically structured latent space to facilitate more efficient learning. As
a result, SVG enables accelerated diffusion training, supports few-step
sampling, and improves generative quality. Experimental results further show
that SVG preserves the semantic and discriminative capabilities of the
underlying self-supervised representations, providing a principled pathway
toward task-general, high-quality visual representations. Code and
interpretations are available at https://howlin-wang.github.io/svg/.
♻ ☆ Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan
Instruction-based image editing has achieved remarkable progress; however,
models solely trained via supervised fine-tuning often overfit to annotated
patterns, hindering their ability to explore and generalize beyond training
distributions. To this end, we introduce Edit-R1, a novel post-training
framework for instruction-based image editing based on policy optimization.
Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a
likelihood-free policy optimization method consistent with the flow matching
forward process, thereby enabling the use of higher-order samplers and more
efficient training. Another key challenge here is the absence of a universal
reward model, resulting from the diverse nature of editing instructions and
tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM)
as a unified, training-free reward model, leveraging its output logits to
provide fine-grained feedback. Furthermore, we carefully design a low-variance
group filtering mechanism to reduce MLLM scoring noise and stabilize
optimization. UniWorld-V2, trained with this framework, achieves
\textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks,
scoring 4.49 and 7.83, respectively. Crucially, our framework is
model-agnostic, delivering substantial performance gains when applied to
diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its
wide applicability. Code and models are publicly available at
https://github.com/PKU-YuanGroup/UniWorld-V2.
♻ ☆ SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Existing research on 3D Large Language Models (LLMs) still struggles to
achieve grounded question-answering, primarily due to the under-exploration of
the mech- anism of human-like scene-object grounded reasoning. This paper
bridges the gap by presenting a novel framework. We first introduce a grounded
Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a
complex reasoning task into simpler and manageable problems, and building
corresponding visual clues based on multimodal expert modules. To enable such a
method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning
dataset, consisting of 185K high-quality instances. Extensive experiments
across various complex 3D scene reasoning benchmarks demonstrate that our new
framework achieves strong performance with high grounding-QA coherence. To the
best of our knowledge, this is the first successful application of CoT
reasoning to 3D scene understanding, enabling step-by-step human-like reasoning
and showing potential for extension to broader 3D scene understanding
scenarios.
comment: Project page: https://scenecot.github.io/
♻ ☆ PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Large Multimodal Models (LMMs) are increasingly applied to scientific
research, yet it remains unclear whether they can reliably understand and
reason over the multimodal complexity of papers. A central challenge lies in
detecting and resolving inconsistencies across text, figures, tables, and
equations, issues that are often subtle, domain-specific, and ultimately
undermine clarity, reproducibility, and trust. Existing benchmarks overlook
this issue, either isolating single modalities or relying on synthetic errors
that fail to capture real-world complexity. We introduce PRISMM-Bench
(Peer-Review-sourced Inconsistency Set for Multimodal Models), the first
benchmark grounded in real reviewer-flagged inconsistencies in scientific
papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering
and human verification, we curate 262 inconsistencies from 242 papers. Based on
this set, we design three tasks, namely inconsistency identification, remedy
and pair matching, which assess a model's capacity to detect, correct, and
reason over inconsistencies across different modalities. Furthermore, to
address the notorious problem of choice-only shortcuts in multiple-choice
evaluation, where models exploit answer patterns without truly understanding
the question, we further introduce structured JSON-based answer representations
that minimize linguistic biases by reducing reliance on superficial stylistic
cues. We benchmark 21 leading LMMs, including large open-weight models
(GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5
with high reasoning). Results reveal strikingly low performance (26.1-54.2%),
underscoring the challenge of multimodal scientific reasoning and motivating
progress towards trustworthy scientific assistants.
♻ ☆ Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu
Fully open multimodal large language models (MLLMs) currently lag behind
proprietary counterparts, primarily due to a significant gap in data quality
for supervised fine-tuning (SFT). Existing open-source datasets are often
plagued by widespread noise and a critical deficit in complex reasoning data,
such as Chain-of-Thought (CoT), which hinders the development of advanced model
capabilities. Addressing these challenges, our work makes three primary
contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising
approximately 15 million QA pairs, processed through multiple cleaning
techniques and enhanced with a novel dual-level (short and long) CoT enrichment
strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its
underlying framework DataStudio, providing the community with a transparent and
adaptable methodology for data curation that moves beyond static dataset
releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B
model on Honey-Data-15M. Experiments show that Bee-8B establishes a new
state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is
competitive with, and in some cases surpasses, recent semi-open models such as
InternVL3.5-8B. Our work delivers to the community a suite of foundational
resources, including: the Honey-Data-15M corpus; the full-stack suite
comprising HoneyPipe and DataStudio; training recipes; an evaluation harness;
and the model weights. This effort demonstrates that a principled focus on data
quality is a key pathway to developing fully open MLLMs that are highly
competitive with their semi-open counterparts.
comment: homepage: https://open-bee.github.io/
♻ ☆ Fourier Transform Multiple Instance Learning for Whole Slide Image Classification
Anthony Bilic, Guangyu Sun, Ming Li, Md Sanzid Bin Hossain, Yu Tian, Wei Zhang, Laura Brattain, Dexter Hadley, Chen Chen
Whole Slide Image (WSI) classification relies on Multiple Instance Learning
(MIL) with spatial patch features, yet existing methods struggle to capture
global dependencies due to the immense size of WSIs and the local nature of
patch embeddings. This limitation hinders the modeling of coarse structures
essential for robust diagnostic prediction. We propose Fourier Transform
Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a
frequency-domain branch to provide compact global context. Low-frequency crops
are extracted from WSIs via the Fast Fourier Transform and processed through a
modular FFT-Block composed of convolutional layers and Min-Max normalization to
mitigate the high variance of frequency data. The learned global frequency
feature is fused with spatial patch features through lightweight integration
strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was
evaluated across six state-of-the-art MIL methods on three public datasets
(BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores
by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across
architectures and datasets. These results establish frequency-domain learning
as an effective and efficient mechanism for capturing global dependencies in
WSI classification, complementing spatial features and advancing the
scalability and accuracy of MIL-based computational pathology.
♻ ☆ UniVideo: Unified Understanding, Generation, and Editing for Videos
Unified multimodal models have shown promising results in multimodal content
generation and editing but remain largely limited to the image domain. In this
work, we present UniVideo, a versatile framework that extends unified modeling
to the video domain. UniVideo adopts a dual-stream design, combining a
Multimodal Large Language Model (MLLM) for instruction understanding with a
Multimodal DiT (MMDiT) for video generation. This design enables accurate
interpretation of complex multimodal instructions while preserving visual
consistency. Built on this architecture, UniVideo unifies diverse video
generation and editing tasks under a single multimodal instruction paradigm and
is jointly trained across them. Extensive experiments demonstrate that UniVideo
matches or surpasses state-of-the-art task-specific baselines in
text/image-to-video generation, in-context video generation and in-context
video editing. Notably, the unified design of UniVideo enables two forms of
generalization. First, UniVideo supports task composition, such as combining
editing with style transfer, by integrating multiple capabilities within a
single instruction. Second, even without explicit training on free-form video
editing, UniVideo transfers its editing capability from large-scale image
editing data to this setting, handling unseen instructions such as
green-screening characters or changing materials within a video. Beyond these
core capabilities, UniVideo also supports visual-prompt-based video generation,
where the MLLM interprets visual prompts and guides the MMDiT during synthesis.
To foster future research, we will release our model and code.
comment: Project Website https://congwei1230.github.io/UniVideo/
♻ ☆ H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging
Zhen Huang, Tao Tang, Ronghao Xu, Yangbo Wei, Wenkai Yang, Suhua Wang, Xiaoxin Sun, Han Li, Qingsong Yao
3D landmark detection is a critical task in medical image analysis, and
accurately detecting anatomical landmarks is essential for subsequent medical
imaging tasks. However, mainstream deep learning methods in this field struggle
to simultaneously capture fine-grained local features and model global spatial
relationships, while maintaining a balance between accuracy and computational
efficiency. Local feature extraction requires capturing fine-grained anatomical
details, while global modeling requires understanding the spatial relationships
within complex anatomical structures. The high-dimensional nature of 3D volume
further exacerbates these challenges, as landmarks are sparsely distributed,
leading to significant computational costs. Therefore, achieving efficient and
precise 3D landmark detection remains a pressing challenge in medical image
analysis.
In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection
\textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature
extraction with a lightweight attention mechanism designed to efficiently
capture global dependencies in 3D volumetric data. This mechanism employs a
hierarchical routing strategy to reduce computational cost while maintaining
global context modeling. To our knowledge, H3DE-Net is the first 3D landmark
detection model that integrates such a lightweight attention mechanism with
CNNs. Additionally, integrating multi-scale feature fusion further enhances
detection accuracy and robustness. Experimental results on a public CT dataset
demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance,
significantly improving accuracy and robustness, particularly in scenarios with
missing landmarks or complex anatomical variations. We aready open-source our
project, including code, data and model weights.
♻ ☆ Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning NeurIPS 2025
Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify
visual comprehension and generation. However, these two capabilities remain
largely independent, as if they are two separate functions encapsulated within
the same model. Consequently, visual comprehension does not enhance visual
generation, and the reasoning mechanisms of LLMs have not been fully integrated
to revolutionize image generation. In this paper, we propose to enable the
collaborative co-evolution of visual comprehension and generation, advancing
image generation into an iterative introspective process. We introduce a
two-stage training approach: supervised fine-tuning teaches the MLLM with the
foundational ability to generate genuine CoT for visual generation, while
reinforcement learning activates its full potential via an
exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in
visual generation, advancing MLLMs from text-to-image tasks to unified image
generation. Extensive experiments demonstrate that our model not only excels in
text-to-image generation and image editing, but also functions as a superior
image semantic evaluator with enhanced visual comprehension capabilities.
Project Page: https://janus-pro-r1.github.io.
comment: Accepted by NeurIPS 2025
♻ ★ VideoVerse: How Far is Your T2V Generator from a World Model?
The recent rapid advancement of Text-to-Video (T2V) generation technologies,
which are critical to build ``world models'', makes the existing benchmarks
increasingly insufficient to evaluate state-of-the-art T2V models. First,
current evaluation dimensions, such as per-frame aesthetic quality and temporal
consistency, are no longer able to differentiate state-of-the-art T2V models.
Second, event-level temporal causality, which not only distinguishes video from
other modalities but also constitutes a crucial component of world models, is
severely underexplored in existing benchmarks. Third, existing benchmarks lack
a systematic assessment of world knowledge, which are essential capabilities
for building world models. To address these issues, we introduce VideoVerse, a
comprehensive benchmark that focuses on evaluating whether a T2V model could
understand complex temporal causality and world knowledge in the real world. We
collect representative videos across diverse domains (e.g., natural landscapes,
sports, indoor scenes, science fiction, chemical and physical experiments) and
extract their event-level descriptions with inherent temporal causality, which
are then rewritten into text-to-video prompts by independent annotators. For
each prompt, we design a suite of binary evaluation questions from the
perspective of dynamic and static properties, with a total of ten carefully
defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully
curated prompts, involving 815 events and 793 binary evaluation questions.
Consequently, a human preference aligned QA-based evaluation pipeline is
developed by using modern vision-language models. Finally, we perform a
systematic evaluation of state-of-the-art open-source and closed-source T2V
models on VideoVerse, providing in-depth analysis on how far the current T2V
generators are from world models.
comment: 24 Pages, 8 Figures, 11 Tables
♻ ☆ Interpretable Decision-Making for End-to-End Autonomous Driving ICCV 2025
Trustworthy AI is mandatory for the broad deployment of autonomous vehicles.
Although end-to-end approaches derive control commands directly from raw data,
interpreting these decisions remains challenging, especially in complex urban
scenarios. This is mainly attributed to very deep neural networks with
non-linear decision boundaries, making it challenging to grasp the logic behind
AI-driven decisions. This paper presents a method to enhance interpretability
while optimizing control commands in autonomous driving. To address this, we
propose loss functions that promote the interpretability of our model by
generating sparse and localized feature maps. The feature activations allow us
to explain which image regions contribute to the predicted control command. We
conduct comprehensive ablation studies on the feature extraction step and
validate our method on the CARLA benchmarks. We also demonstrate that our
approach improves interpretability, which correlates with reducing infractions,
yielding a safer, high-performance driving model. Notably, our monocular,
non-ensemble model surpasses the top-performing approaches from the CARLA
Leaderboard by achieving lower infraction scores and the highest route
completion rate, all while ensuring interpretability.
comment: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of
Out-of-Label Hazards in Autonomous Driving (2COOOL)
♻ ☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Yongjie Bai, Zhouxia Wang, Yang Liu, Weixing Chen, Ziliang Chen, Mingtong Dai, Yongsen Zheng, Lingbo Liu, Guanbin Li, Liang Lin
Recent vision-language-action (VLA) models for multi-task robotic
manipulation commonly rely on static viewpoints and shared visual encoders,
which limit 3D perception and cause task interference, hindering robustness and
generalization. In this work, we propose Task-Aware View Planning (TAVP), a
framework designed to overcome these challenges by integrating active view
planning with task-specific representation learning. TAVP employs an efficient
exploration policy, accelerated by a novel pseudo-environment, to actively
acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE)
visual encoder to disentangle features across different tasks, boosting both
representation fidelity and task generalization. By learning to see the world
in a task-aware way, TAVP generates more complete and discriminative visual
representations, demonstrating significantly enhanced action prediction across
a wide array of manipulation challenges. Extensive experiments on RLBench tasks
show that our proposed TAVP model achieves superior performance over
state-of-the-art fixed-view approaches. Visual results and code are provided
at: https://hcplab-sysu.github.io/TAVP.
comment: 14 pages, 8 figures, project page: https://hcplab-sysu.github.io/TAVP
♻ ☆ SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction
Accurate cortical surface reconstruction from magnetic resonance imaging
(MRI) data is crucial for reliable neuroanatomical analyses. Current methods
have to contend with complex cortical geometries, strict topological
requirements, and often produce surfaces with overlaps, self-intersections, and
topological defects. To overcome these shortcomings, we introduce SimCortex, a
deep learning framework that simultaneously reconstructs all brain surfaces
(left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while
preserving topological properties. Our method first segments the T1w image into
a nine-class tissue label map. From these segmentations, we generate
subject-specific, collision-free initial surface meshes. These surfaces serve
as precise initializations for subsequent multiscale diffeomorphic
deformations. Employing stationary velocity fields (SVFs) integrated via
scaling-and-squaring, our approach ensures smooth, topology-preserving
transformations with significantly reduced surface collisions and
self-intersections. Evaluations on standard datasets demonstrate that SimCortex
dramatically reduces surface overlaps and self-intersections, surpassing
current methods while maintaining state-of-the-art geometric accuracy.
comment: Metadata update: added journal reference and DOI linking to the
published chapter (Springer)
♻ ☆ Increasing the Utility of Synthetic Images through Chamfer Guidance NeurIPS 2025
Nicola Dall'Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek, Adriana Romero-Soriano, Michal Drozdzal
Conditional image generative models hold considerable promise to produce
infinite amounts of synthetic training data. Yet, recent progress in generation
quality has come at the expense of generation diversity, limiting the utility
of these models as a source of synthetic training data. Although guidance-based
approaches have been introduced to improve the utility of generated data by
focusing on quality or diversity, the (implicit or explicit) utility functions
oftentimes disregard the potential distribution shift between synthetic and
real data. In this work, we introduce Chamfer Guidance: a training-free
guidance approach which leverages a handful of real exemplar images to
characterize the quality and diversity of synthetic data. We show that by
leveraging the proposed Chamfer Guidance, we can boost the diversity of the
generations w.r.t. a dataset of real images while maintaining or improving the
generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our
approach achieves state-of-the-art few-shot performance with as little as 2
exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms
of distributional coverage, which increase to 97.5% and 92.7%, respectively,
when using 32 real images. We showcase the benefits of the Chamfer Guidance
generation by training downstream image classifiers on synthetic data,
achieving accuracy boost of up to 15% for in-distribution over the baselines,
and up to 16% in out-of-distribution. Furthermore, our approach does not
require using the unconditional model, and thus obtains a 31% reduction in
FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.
comment: Accepted to NeurIPS 2025
♻ ☆ RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative Models
Diffusion models have achieved state-of-the-art performance in generative
modeling, yet their sampling procedures remain vulnerable to
hallucinations-often stemming from inaccuracies in score approximation. In this
work, we reinterpret diffusion sampling through the lens of optimization and
introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method
that detects and corrects high-risk sampling steps using geometric cues from
the loss landscape. RODS enforces smoother sampling trajectories and adaptively
adjusts perturbations, reducing hallucinations without retraining and at
minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands
demonstrate that RODS maintains comparable image quality and preserves
generation diversity. More importantly, it improves both sampling fidelity and
robustness, detecting over 70% of hallucinated samples and correcting more than
25%, all while avoiding the introduction of new artifacts. We release our code
at https://github.com/Yiqi-Verna-Tian/RODS.
♻ ☆ Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation NeurIPS 2025
Common computer vision systems typically assume ideal pinhole cameras but
fail when facing real-world camera effects such as fisheye distortion and
rolling shutter, mainly due to the lack of learning from training data with
camera effects. Existing data generation approaches suffer from either high
costs, sim-to-real gaps or fail to accurately model camera effects. To address
this bottleneck, we propose 4D Gaussian Ray Tracing (4D-GRT), a novel two-stage
pipeline that combines 4D Gaussian Splatting with physically-based ray tracing
for camera effect simulation. Given multi-view videos, 4D-GRT first
reconstructs dynamic scenes, then applies ray tracing to generate videos with
controllable, physically accurate camera effects. 4D-GRT achieves the fastest
rendering speed while performing better or comparable rendering quality
compared to existing baselines. Additionally, we construct eight synthetic
dynamic scenes in indoor environments across four camera effects as a benchmark
to evaluate generated videos with camera effects.
comment: Paper accepted to NeurIPS 2025 Workshop SpaVLE. Project page:
https://shigon255.github.io/4DGRT-project-page/
♻ ☆ VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
Coordinating multiple embodied agents in dynamic environments remains a core
challenge in artificial intelligence, requiring both perception-driven
reasoning and scalable cooperation strategies. While recent works have
leveraged large language models (LLMs) for multi-agent planning, a few have
begun to explore vision-language models (VLMs) for visual reasoning. However,
these VLM-based approaches remain limited in their support for diverse
embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical
benchmark tailored for embodied multi-agent cooperation, featuring three
structured levels: agent activation, task planning, and trajectory perception.
VIKI-Bench includes diverse robot embodiments, multi-view visual observations,
and structured supervision signals to evaluate reasoning grounded in visual
inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a
two-stage framework that fine-tunes a pretrained vision-language model (VLM)
using Chain-of-Thought annotated demonstrations, followed by reinforcement
learning under multi-level reward signals. Our extensive experiments show that
VIKI-R significantly outperforms baselines method across all task levels.
Furthermore, we show that reinforcement learning enables the emergence of
compositional cooperation patterns among heterogeneous agents. Together,
VIKI-Bench and VIKI-R offer a unified testbed and method for advancing
multi-agent, visual-driven cooperation in embodied AI systems.
comment: Project page: https://faceong.github.io/VIKI-R/
♻ ☆ Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles
Àlmos Veres-Vitàlyos, Genis Castillo Gomez-Raya, Filip Lemic, Daniel Johannes Bugelnig, Bernhard Rinner, Sergi Abadal, Xavier Costa-Pérez
Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for
navigating indoor and hard-to-reach areas, yet their significant constraints in
payload and autonomy have largely prevented their use for complex tasks like
high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we
introduce a novel system architecture that enables fully autonomous,
high-fidelity 3D scanning of static objects using UAVs weighing under 100
grams. Our core innovation lies in a dual-reconstruction pipeline that creates
a real-time feedback loop between data capture and flight control. A
near-real-time (near-RT) process uses Structure from Motion (SfM) to generate
an instantaneous pointcloud of the object. The system analyzes the model
quality on the fly and dynamically adapts the UAV's trajectory to intelligently
capture new images of poorly covered areas. This ensures comprehensive data
acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline
employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR)
approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB)
location data to achieve superior accuracy. We implemented and validated this
architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both
single- and multi-UAV configurations, conclusively show that dynamic trajectory
adaptation consistently improves reconstruction quality over static flight
paths. This work demonstrates a scalable and autonomous solution that unlocks
the potential of miniaturized UAVs for fine-grained 3D reconstruction in
constrained environments, a capability previously limited to much larger
platforms.
comment: 13 pages, 16 figures, 3 tables, 45 references
♻ ☆ Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning
Medical Vision Foundation Models (Med-VFMs) have superior capabilities of
interpreting medical images due to the knowledge learned from self-supervised
pre-training with extensive unannotated images. To improve their performance on
adaptive downstream evaluations, especially segmentation, a few samples from
target domains are selected randomly for fine-tuning them. However, there lacks
works to explore the way of adapting Med-VFMs to achieve the optimal
performance on target domains efficiently. Thus, it is highly demanded to
design an efficient way of fine-tuning Med-VFMs by selecting informative
samples to maximize their adaptation performance on target domains. To achieve
this, we propose an Active Source-Free Domain Adaptation (ASFDA) method to
efficiently adapt Med-VFMs to target domains for volumetric medical image
segmentation. This ASFDA employs a novel Active Learning (AL) method to select
the most informative samples from target domains for fine-tuning Med-VFMs
without the access to source pre-training samples, thus maximizing their
performance with the minimal selection budget. In this AL method, we design an
Active Test Time Sample Query strategy to select samples from the target
domains via two query metrics, including Diversified Knowledge Divergence (DKD)
and Anatomical Segmentation Difficulty (ASD). DKD is designed to measure the
source-target knowledge gap and intra-domain diversity. It utilizes the
knowledge of pre-training to guide the querying of source-dissimilar and
semantic-diverse samples from the target domains. ASD is designed to evaluate
the difficulty in segmentation of anatomical structures by measuring predictive
entropy from foreground regions adaptively. Additionally, our ASFDA method
employs a Selective Semi-supervised Fine-tuning to improve the performance and
efficiency of fine-tuning by identifying samples with high reliability from
unqueried ones.
comment: 17 pages, 5 figures, 8 tables
♻ ☆ Deep Learning in Palmprint Recognition-A Comprehensive Survey
Palmprint recognition has emerged as a prominent biometric technology, widely
applied in diverse scenarios. Traditional handcrafted methods for palmprint
recognition often fall short in representation capability, as they heavily
depend on researchers' prior knowledge. Deep learning (DL) has been introduced
to address this limitation, leveraging its remarkable successes across various
domains. While existing surveys focus narrowly on specific tasks within
palmprint recognition-often grounded in traditional methodologies-there remains
a significant gap in comprehensive research exploring DL-based approaches
across all facets of palmprint recognition. This paper bridges that gap by
thoroughly reviewing recent advancements in DL-powered palmprint recognition.
The paper systematically examines progress across key tasks, including
region-of-interest segmentation, feature extraction, and
security/privacy-oriented challenges. Beyond highlighting these advancements,
the paper identifies current challenges and uncovers promising opportunities
for future research. By consolidating state-of-the-art progress, this review
serves as a valuable resource for researchers, enabling them to stay abreast of
cutting-edge technologies and drive innovation in palmprint recognition.
comment: Palmprint recognition, biometrics, deep learning, feature extraction,
recognition tasks
♻ ☆ UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning NeurIPS 2025
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their
remarkable success as general-purpose multi-modal assistants, with particular
focuses on holistic image- and video-language understanding. Conversely, less
attention has been given to scaling fine-grained pixel-level understanding
capabilities, where the models are expected to realize pixel-level alignment
between visual signals and language semantics. Some previous studies have
applied LMMs to related tasks such as region-level captioning and referring
expression segmentation. However, these models are limited to performing either
referring or segmentation tasks independently and fail to integrate these
fine-grained perception capabilities into visual reasoning. To bridge this gap,
we propose UniPixel, a large multi-modal model capable of flexibly
comprehending visual prompt inputs and generating mask-grounded responses. Our
model distinguishes itself by seamlessly integrating pixel-level perception
with general visual understanding capabilities. Specifically, UniPixel
processes visual prompts and generates relevant masks on demand, and performs
subsequent reasoning conditioning on these intermediate pointers during
inference, thereby enabling fine-grained pixel-level reasoning. The
effectiveness of our approach has been verified on 10 benchmarks across a
diverse set of tasks, including pixel-level referring/segmentation and
object-centric understanding in images/videos. A novel PixelQA task that
jointly requires referring, segmentation, and question answering is also
designed to verify the flexibility of our method.
comment: NeurIPS 2025 Camera Ready. Project Page:
https://polyu-chenlab.github.io/unipixel/
♻ ☆ Regression is all you need for medical image translation
While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have
achieved impressive results in natural image synthesis, their core strengths -
creativity and realism - can be detrimental in medical applications, where
accuracy and fidelity are paramount. These models instead risk introducing
hallucinations and replication of unwanted acquisition noise. Here, we propose
YODA (You Only Denoise once - or Average), a 2.5D diffusion-based framework for
medical image translation (MIT). Consistent with DM theory, we find that
conventional diffusion sampling stochastically replicates noise. To mitigate
this, we draw and average multiple samples, akin to physical signal averaging.
As this effectively approximates the DM's expected value, we term this
Expectation-Approximation (ExpA) sampling. We additionally propose regression
sampling YODA, which retains the initial DM prediction and omits iterative
refinement to produce noise-free images in a single step. Across five diverse
multi-modal datasets - including multi-contrast brain MRI and pelvic MRI-CT -
we demonstrate that regression sampling is not only substantially more
efficient but also matches or exceeds image quality of full diffusion sampling
even with ExpA. Our results reveal that iterative refinement solely enhances
perceptual realism without benefiting information translation, which we confirm
in relevant downstream tasks. YODA outperforms eight state-of-the-art DMs and
GANs and challenges the presumed superiority of DMs and GANs over
computationally cheap regression models for high-quality MIT. Furthermore, we
show that YODA-translated images are interchangeable with, or even superior to,
physical acquisitions for several medical applications.
♻ ☆ ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text
Virtual try-on, which aims to seamlessly fit garments onto person images, has
recently seen significant progress with diffusion-based models. However,
existing methods commonly resort to duplicated backbones or additional image
encoders to extract garment features, which increases computational overhead
and network complexity. In this paper, we propose ITVTON, an efficient
framework that leverages the Diffusion Transformer (DiT) as its single
generator to improve image fidelity. By concatenating garment and person images
along the width dimension and incorporating textual descriptions from both,
ITVTON effectively captures garment-person interactions while preserving
realism. To further reduce computational cost, we restrict training to the
attention parameters within a single Diffusion Transformer (Single-DiT) block.
Extensive experiments demonstrate that ITVTON surpasses baseline methods both
qualitatively and quantitatively, setting a new standard for virtual try-on.
Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness
in real-world scenarios.
comment: Accepted by PRCV 2025
♻ ☆ Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation NeurIPS 2025
Diffusion-based inverse algorithms have shown remarkable performance across
various inverse problems, yet their reliance on numerous denoising steps incurs
high computational costs. While recent developments of fast diffusion ODE
solvers offer effective acceleration for diffusion sampling without
observations, their application in inverse problems remains limited due to the
heterogeneous formulations of inverse algorithms and their prevalent use of
approximations and heuristics, which often introduce significant errors that
undermine the reliability of analytical solvers. In this work, we begin with an
analysis of ODE solvers for inverse problems that reveals a linear combination
structure of approximations for the inverse trajectory. Building on this
insight, we propose a canonical form that unifies a broad class of
diffusion-based inverse algorithms and facilitates the design of more
generalizable solvers. Inspired by the linear subspace search strategy, we
propose Learnable Linear Extrapolation (LLE), a lightweight approach that
universally enhances the performance of any diffusion-based inverse algorithm
conforming to our canonical form. LLE optimizes the combination coefficients to
refine current predictions using previous estimates, alleviating the
sensitivity of analytical solvers for inverse algorithms. Extensive experiments
demonstrate consistent improvements of the proposed LLE method across multiple
algorithms and tasks, indicating its potential for more efficient solutions and
boosted performance of diffusion-based inverse algorithms with limited steps.
Codes for reproducing our experiments are available at
https://github.com/weigerzan/LLE_inverse_problem.
comment: Accepted by NeurIPS 2025
♻ ☆ Pose-free 3D Gaussian splatting via shape-ray estimation ICIP 2025
While generalizable 3D Gaussian splatting enables efficient, high-quality
rendering of unseen scenes, it heavily depends on precise camera poses for
accurate geometry. In real-world scenarios, obtaining accurate poses is
challenging, leading to noisy pose estimates and geometric misalignments. To
address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting
framework that overcomes these ambiguities by joint shape and camera rays
estimation. Instead of relying on explicit 3D transformations, SHARE builds a
pose-aware canonical volume representation that seamlessly integrates
multi-view information, reducing misalignment caused by inaccurate pose
estimates. Additionally, anchor-aligned Gaussian prediction enhances scene
reconstruction by refining local geometry around coarse anchors, allowing for
more precise Gaussian placement. Extensive experiments on diverse real-world
datasets show that our method achieves robust performance in pose-free
generalizable Gaussian splatting. Code is avilable at
https://github.com/youngju-na/SHARE
comment: ICIP 2025 (Best Student Paper Award) Code available at:
https://github.com/youngju-na/SHARE
♻ ☆ Mask Image Watermarking NeurIPS
We present MaskWM, a simple, efficient, and flexible framework for image
watermarking. MaskWM has two variants: (1) MaskWM-D, which supports global
watermark embedding, watermark localization, and local watermark extraction for
applications such as tamper detection; (2) MaskWM-ED, which focuses on local
watermark embedding and extraction, offering enhanced robustness in small
regions to support fine-grined image protection. MaskWM-D builds on the
classical encoder-distortion layer-decoder training paradigm. In MaskWM-D, we
introduce a simple masking mechanism during the decoding stage that enables
both global and local watermark extraction. During training, the decoder is
guided by various types of masks applied to watermarked images before
extraction, helping it learn to localize watermarks and extract them from the
corresponding local areas. MaskWM-ED extends this design by incorporating the
mask into the encoding stage as well, guiding the encoder to embed the
watermark in designated local regions, which improves robustness under regional
attacks. Extensive experiments show that MaskWM achieves state-of-the-art
performance in global and local watermark extraction, watermark localization,
and multi-watermark embedding. It outperforms all existing baselines, including
the recent leading model WAM for local watermarking, while preserving high
visual quality of the watermarked images. In addition, MaskWM is highly
efficient and adaptable. It requires only 20 hours of training on a single
A6000 GPU, achieving 15x computational efficiency compared to WAM. By simply
adjusting the distortion layer, MaskWM can be quickly fine-tuned to meet
varying robustness requirements.
comment: Neural Information Processing Systems (NeurIPS) 2025
♻ ☆ scSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy NeurIPS 2025
Fluorescence microscopy, while being a key driver for progress in the life
sciences, is also subject to technical limitations. To overcome them,
computational multiplexing techniques have recently been proposed, which allow
multiple cellular structures to be captured in a single image and later be
unmixed. Existing image decomposition methods are trained on a set of
superimposed input images and the respective unmixed target images. It is
critical to note that the relative strength (mixing ratio) of the superimposed
images for a given input is a priori unknown. However, existing methods are
trained on a fixed intensity ratio of superimposed inputs, making them not
cognizant of the range of relative intensities that can occur in fluorescence
microscopy. In this work, we propose a novel method called scSplit that is
cognizant of the severity of the above-mentioned mixing ratio. Our idea is
based on InDI , a popular iterative method for image restoration, and an ideal
starting point to embrace the unknown mixing ratio in any given input. We
introduce (i) a suitably trained regressor network that predicts the
degradation level (mixing ratio) of a given input image and (ii) a
degradation-specific normalization module, enabling degradation-aware inference
across all mixing ratios. We show that this method solves two relevant tasks in
fluorescence microscopy, namely image splitting and bleedthrough removal, and
empirically demonstrate the applicability of scSplit on 5 public datasets. The
source code with pre-trained models is hosted at
https://github.com/juglab/scSplit/.
comment: manuscript accepted at NeurIPS 2025
♻ ☆ A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography
Yui Lo, Yuqian Chen, Dongnan Liu, Leo Zekelman, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Alexandra J. Golby, Fan Zhang, Weidong Cai, Lauren J. O'Donnell
Shape measures have emerged as promising descriptors of white matter
tractography, offering complementary insights into anatomical variability and
associations with cognitive and clinical phenotypes. However, conventional
methods for computing shape measures are computationally expensive and
time-consuming for large-scale datasets due to reliance on voxel-based
representations. We propose Tract2Shape, a novel multimodal deep learning
framework that leverages geometric (point cloud) and scalar (tabular) features
to predict ten white matter tractography shape measures. To enhance model
efficiency, we utilize a dimensionality reduction algorithm for the model to
predict five primary shape components. The model is trained and evaluated on
two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset.
We evaluate the performance of Tract2Shape by training and testing it on the
HCP-YA dataset and comparing the results with state-of-the-art models. To
further assess its robustness and generalization ability, we also test
Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep
learning models across all ten shape measures, achieving the highest average
Pearson's r and the lowest nMSE on the HCP-YA dataset. The ablation study shows
that both multimodal input and PCA contribute to performance gains. On the
unseen testing PPMI dataset, Tract2Shape maintains a high Pearson's r and low
nMSE, demonstrating strong generalizability in cross-dataset evaluation.
Tract2Shape enables fast, accurate, and generalizable prediction of white
matter shape measures from tractography data, supporting scalable analysis
across datasets. This framework lays a promising foundation for future
large-scale white matter shape analysis.
comment: Paper accepted to Human Brain Mapping. 25 pages, 3 figures, 8 tables
♻ ☆ REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
In this paper we tackle a fundamental question: "Can we train latent
diffusion models together with the variational auto-encoder (VAE) tokenizer in
an end-to-end manner?" Traditional deep-learning wisdom dictates that
end-to-end training is often preferable when possible. However, for latent
diffusion transformers, it is observed that end-to-end training both VAE and
diffusion-model using standard diffusion-loss is ineffective, even causing a
degradation in final performance. We show that while diffusion loss is
ineffective, end-to-end training can be unlocked through the
representation-alignment (REPA) loss -- allowing both VAE and diffusion model
to be jointly tuned during the training process. Despite its simplicity, the
proposed training recipe (REPA-E) shows remarkable performance; speeding up
diffusion model training by over 17x and 45x over REPA and vanilla training
recipes, respectively. Interestingly, we observe that end-to-end tuning with
REPA-E also improves the VAE itself; leading to improved latent space structure
and downstream generation performance. In terms of final performance, our
approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and
without classifier-free guidance on ImageNet 256 x 256. Code is available at
https://end2end-diffusion.github.io.
♻ ☆ VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching NeurIPS 2025
Vision-Language-Action (VLA) models have demonstrated strong multi-modal
reasoning capabilities, enabling direct action generation from visual
perception and language instructions in an end-to-end manner. However, their
substantial computational cost poses a challenge for real-time robotic control,
where rapid decision-making is essential. This paper introduces VLA-Cache, a
training-free inference acceleration method that reduces computational overhead
by adaptively caching and reusing static visual tokens across frames.
Exploiting the temporal continuity in robotic manipulation, VLA-Cache
identifies minimally changed tokens between adjacent frames and reuses their
cached key-value representations, thereby circumventing redundant computations.
Additionally, to maintain action precision, VLA-Cache selectively re-computes
task-relevant tokens that are environmentally sensitive, ensuring the fidelity
of critical visual information. To further optimize efficiency, we introduce a
layer adaptive token reusing strategy that dynamically adjusts the reuse ratio
based on attention concentration across decoder layers, prioritizing critical
tokens for recomputation. Extensive experiments on two simulation platforms
(LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache
achieves up to 1.7x speedup in CUDA latency and a 15% increase in control
frequency, with negligible loss on task success rate. The code and videos can
be found at our project page: https://vla-cache.github.io.
comment: Accepted to NeurIPS 2025
♻ ★ VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing
reasoning and generalization capabilities of large language models (LLMs)
through reinforcement learning. Nevertheless, the potential of
reasoning-induced computation has not been thoroughly explored in the context
of image quality assessment (IQA), a task depending critically on visual
reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced
no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to
rank, a learning algorithm tailored to the intrinsically relative nature of
visual quality. Specifically, for a pair of images, we employ group relative
policy optimization to generate multiple quality scores for each image. These
estimates are used to compute comparative probabilities of one image having
higher quality than the other under the Thurstone model. Rewards for each
quality estimate are defined using continuous fidelity measures rather than
discretized binary labels. Extensive experiments show that the proposed
VisualQuality-R1 consistently outperforms discriminative deep learning-based
NR-IQA models as well as a recent reasoning-induced quality regression method.
Moreover, VisualQuality-R1 is capable of generating contextually rich,
human-aligned quality descriptions, and supports multi-dataset training without
requiring perceptual scale realignment. These features make VisualQuality-R1
especially well-suited for reliably measuring progress in a wide range of image
processing tasks like super-resolution and image generation.
♻ ☆ MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan
Vision language models (VLMs) are increasingly deployed as controllers with
access to external tools for complex reasoning and decision-making, yet their
effectiveness remains limited by the scarcity of high-quality multimodal
trajectories and the cost of manual annotation. We address this challenge with
a vision-centric agent tuning framework that automatically synthesizes
multimodal trajectories, generates step-wise preference pairs, and trains a VLM
controller for robust tool-use reasoning. Our pipeline first constructs
M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified
trajectories, enabling imitation-based trajectory tuning. Building on this, we
develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool
reasoning. To achieve finer alignment, we further introduce Pref-X, a set of
11K automatically generated preference pairs, and optimize MATRIX on it via
step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA,
MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating
scalable and effective multimodal tool use. Our data and code is avaliable at
https://github.com/mbzuai-oryx/MATRIX.
comment: We have come across a recent approach that has not been properly
attributed at the time of submission and compared in a fair setting.
Therefore, we would like to withdraw the paper to address these concerns
♻ ☆ ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model NeurIPS2025
In real-word scenarios, person re-identification (ReID) expects to identify a
person-of-interest via the descriptive query, regardless of whether the query
is a single modality or a combination of multiple modalities. However, existing
methods and datasets remain constrained to limited modalities, failing to meet
this requirement. Therefore, we investigate a new challenging problem called
Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve
effective retrieval with varying multi-modal queries. To address dataset
scarcity, we construct ORBench, the first high-quality multi-modal dataset
comprising 1,000 unique identities across five modalities: RGB, infrared, color
pencil, sketch, and textual description. This dataset also has significant
superiority in terms of diversity, such as the painting perspectives and
textual information. It could serve as an ideal platform for follow-up
investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal
learning framework for person ReID. It enables synergistic fusion and
cross-modal alignment of arbitrary modality combinations in a single model,
with a unified encoding and multi-expert routing mechanism proposed. Extensive
experiments verify the advancement and practicality of our ORBench. A wide
range of possible models have been evaluated and compared on it, and our
proposed ReID5o model gives the best performance. The dataset and code will be
made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.
comment: NeurIPS2025 Accepted Paper
♻ ☆ Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
3D medical image analysis is essential for modern healthcare, yet traditional
task-specific models are inadequate due to limited generalizability across
diverse clinical scenarios. Multimodal large language models (MLLMs) offer a
promising solution to these challenges. However, existing MLLMs have
limitations in fully leveraging the rich, hierarchical information embedded in
3D medical images. Inspired by clinical practice, where radiologists focus on
both 3D spatial structure and 2D planar content, we propose Med-2E3, a 3D
medical MLLM that integrates a dual 3D-2D encoder architecture. To aggregate 2D
features effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring
module, which scores the attention of each 2D slice based on slice contents and
task instructions. To the best of our knowledge, Med-2E3 is the first MLLM to
integrate both 3D and 2D features for 3D medical image analysis. Experiments on
large-scale, open-source 3D medical multimodal datasets demonstrate that TG-IS
exhibits task-specific attention distribution and significantly outperforms
current state-of-the-art models. The code is available at:
https://github.com/MSIIP/Med-2E3
♻ ☆ GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction NeurIPS 2025
Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick
Plant traits such as leaf carbon content and leaf mass are essential
variables in the study of biodiversity and climate change. However,
conventional field sampling cannot feasibly cover trait variation at
ecologically meaningful spatial scales. Machine learning represents a valuable
solution for plant trait prediction across ecosystems, leveraging hyperspectral
data from remote sensing. Nevertheless, trait prediction from hyperspectral
data is challenged by label scarcity and substantial domain shifts (\eg across
sensors, ecological distributions), requiring robust cross-domain methods.
Here, we present GreenHyperSpectra, a pretraining dataset encompassing
real-world cross-sensor and cross-ecosystem samples designed to benchmark trait
prediction with semi- and self-supervised methods. We adopt an evaluation
framework encompassing in-distribution and out-of-distribution scenarios. We
successfully leverage GreenHyperSpectra to pretrain label-efficient
multi-output regression models that outperform the state-of-the-art supervised
baseline. Our empirical analyses demonstrate substantial improvements in
learning spectral representations for trait prediction, establishing a
comprehensive methodological framework to catalyze research at the intersection
of representation learning and plant functional traits assessment. All code and
data are available at: https://github.com/echerif18/HyspectraSSL.
comment: Accepted at the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025)
♻ ☆ Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision.
Existing methods either downsample frames for single-pass reasoning,
sacrificing fine-grained details, or depend on textual reasoning over
task-agnostic representations, hindering task-specific perception and
exploration. In this paper, we propose VideoExplorer, a framework grounded in
the principle of ``thinking with video'', which naturally intertwines planning,
temporal grounding, and scalable perception into a coherent reasoning process.
Rather than reasoning over a static context, VideoExplorer iteratively
formulates sub-questions, locates relevant moments, and performs task-oriented,
temporally scalable video understanding until reaching the final answer,
enabling faithful, efficient, and interpretable reasoning. To address the lack
of LVU training resources, we construct a long-video reasoning dataset using
difficulty-adaptive sampling to ensure high-quality trajectories on complex
tasks. Building on this dataset, we design a two-stage training pipeline:
supervised trajectory initialization followed by trajectory-level preference
optimization, encouraging adaptive temporal grounding and iterative information
integration guided by downstream rewards. Extensive evaluations on popular
long-video understanding and reasoning benchmarks demonstrate VideoExplorer's
significant advantage over existing baselines, highlighting its robustness,
adaptability, and efficiency. Our code is made publicly available in this
repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ DA$^2$: Depth Anything in Any Direction
Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, Chunchao Guo
Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more
complete visual description than perspective images. Thanks to this
characteristic, panoramic depth estimation is gaining increasing traction in 3D
vision. However, due to the scarcity of panoramic data, previous methods are
often restricted to in-domain settings, leading to poor zero-shot
generalization. Furthermore, due to the spherical distortions inherent in
panoramas, many approaches rely on perspective splitting (e.g., cubemaps),
which leads to suboptimal efficiency. To address these challenges, we propose
$\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in
$\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and
fully end-to-end panoramic depth estimator. Specifically, for scaling up
panoramic data, we introduce a data curation engine for generating high-quality
panoramic depth data from perspective, and create $\sim$543K panoramic
RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the
spherical distortions, we present SphereViT, which explicitly leverages
spherical coordinates to enforce the spherical geometric consistency in
panoramic image features, yielding improved performance. A comprehensive
benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA
performance, with an average 38% improvement on AbsRel over the strongest
zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain
methods, highlighting its superior zero-shot generalization. Moreover, as an
end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based
approaches. Both the code and the curated panoramic data has be released.
Project page: https://depth-any-in-any-dir.github.io/.
comment: Work primarily done during an internship at Tencent Hunyuan. Project
page: https://depth-any-in-any-dir.github.io/
♻ ☆ H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting
Dynamic scene reconstruction poses a persistent challenge in 3D vision.
Deformable 3D Gaussian Splatting has emerged as an effective method for this
task, offering real-time rendering and high visual fidelity. This approach
decomposes a dynamic scene into a static representation in a canonical space
and time-varying scene motion. Scene motion is defined as the collective
movement of all Gaussian points, and for compactness, existing approaches
commonly adopt implicit neural fields or sparse control points. However, these
methods predominantly rely on gradient-based optimization for all motion
information. Due to the high degree of freedom, they struggle to converge on
real-world datasets exhibiting complex motion. To preserve the compactness of
motion representation and address convergence challenges, this paper proposes
heterogeneous 3D control points, termed \textbf{H3D control points}, whose
attributes are obtained using a hybrid strategy combining optical flow
back-projection and gradient-based methods. This design decouples directly
observable motion components from those that are geometrically occluded.
Specifically, components of 3D motion that project onto the image plane are
directly acquired via optical flow back projection, while unobservable portions
are refined through gradient-based optimization. Experiments on the Neu3DV and
CMU-Panoptic datasets demonstrate that our method achieves superior performance
over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably,
our method converges within just 100 iterations and achieves a per-frame
processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.
♻ ☆ WMamba: Wavelet-based Mamba for Face Forgery Detection ACM MM 2025
The rapid evolution of deepfake generation technologies necessitates the
development of robust face forgery detection algorithms. Recent studies have
demonstrated that wavelet analysis can enhance the generalization abilities of
forgery detectors. Wavelets effectively capture key facial contours, often
slender, fine-grained, and globally distributed, that may conceal subtle
forgery artifacts imperceptible in the spatial domain. However, current
wavelet-based approaches fail to fully exploit the distinctive properties of
wavelet data, resulting in sub-optimal feature extraction and limited
performance gains. To address this challenge, we introduce WMamba, a novel
wavelet-based feature extractor built upon the Mamba architecture. WMamba
maximizes the utility of wavelet information through two key innovations.
First, we propose Dynamic Contour Convolution (DCConv), which employs specially
crafted deformable kernels to adaptively model slender facial contours. Second,
by leveraging the Mamba architecture, our method captures long-range spatial
relationships with linear complexity. This efficiency allows for the extraction
of fine-grained, globally distributed forgery artifacts from small image
patches. Extensive experiments show that WMamba achieves state-of-the-art
(SOTA) performance, highlighting its effectiveness in face forgery detection.
comment: Accepted by ACM MM 2025
♻ ☆ Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable NeurIPS 2025
Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, Shouhong Ding
Existing detectors are often trained on biased datasets, leading to the
possibility of overfitting on non-causal image attributes that are spuriously
correlated with real/synthetic labels. While these biased features enhance
performance on the training data, they result in substantial performance
degradation when applied to unbiased datasets. One common solution is to
perform dataset alignment through generative reconstruction, matching the
semantic content between real and synthetic images. However, we revisit this
approach and show that pixel-level alignment alone is insufficient. The
reconstructed images still suffer from frequency-level misalignment, which can
perpetuate spurious correlations. To illustrate, we observe that reconstruction
models tend to restore the high-frequency details lost in real images (possibly
due to JPEG compression), inadvertently creating a frequency-level
misalignment, where synthetic images appear to have richer high-frequency
content than real ones. This misalignment leads to models associating
high-frequency features with synthetic labels, further reinforcing biased cues.
To resolve this, we propose Dual Data Alignment (DDA), which aligns both the
pixel and frequency domains. Moreover, we introduce two new test sets:
DDA-COCO, containing DDA-aligned synthetic images for testing detector
performance on the most aligned dataset, and EvalGEN, featuring the latest
generative models for assessing detectors under new generative architectures
such as visual auto-regressive generators. Finally, our extensive evaluations
demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could
improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on
in-the-wild benchmarks, highlighting the improved generalizability of unbiased
detectors. Our code is available at:
https://github.com/roy-ch/Dual-Data-Alignment.
comment: NeurIPS 2025 Spotlight. 13 Pages, 10 figures
♻ ☆ gen2seg: Generative Models Enable Generalizable Instance Segmentation
By pretraining to synthesize coherent images from perturbed inputs,
generative models inherently learn to understand object boundaries and scene
compositions. How can we repurpose these generative representations for
general-purpose perceptual organization? We finetune Stable Diffusion and MAE
(encoder+decoder) for category-agnostic instance segmentation using our
instance coloring loss exclusively on a narrow set of object types (indoor
furnishings and cars). Surprisingly, our models exhibit strong zero-shot
generalization, accurately segmenting objects of types and styles unseen in
finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our
best-performing models closely approach the heavily supervised SAM when
evaluated on unseen object types and styles, and outperform it when segmenting
fine structures and ambiguous boundaries. In contrast, existing promptable
segmentation architectures or discriminatively pretrained models fail to
generalize. This suggests that generative models learn an inherent grouping
mechanism that transfers across categories and domains, even without
internet-scale pretraining. Code, pretrained models, and demos are available on
our website.
comment: Website: https://reachomk.github.io/gen2seg/
♻ ☆ Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning
Cryoablation is a minimally invasive localised treatment for prostate cancer
that destroys malignant tissue during de-freezing, while sparing surrounding
healthy structures. Its success depends on accurate preoperative planning of
cryoprobe placements to fully cover the tumour and avoid critical anatomy. This
planning is currently manual, expertise-dependent, and time-consuming, leading
to variability in treatment quality and limited scalability. In this work, we
introduce Cryo-RL, a reinforcement learning framework that models cryoablation
planning as a Markov decision process and learns an optimal policy for
cryoprobe placement. Within a simulated environment that models clinical
constraints and stochastic intraoperative variability, an agent sequentially
selects cryoprobe positions and ice sphere diameters. Guided by a reward
function based on tumour coverage, this agent learns a cryoablation strategy
that leads to optimal cryoprobe placements without the need for any
manually-designed plans. Evaluated on 583 retrospective prostate cancer cases,
Cryo-RL achieved over 8 percentage-point Dice improvements compared with the
best automated baselines, based on geometric optimisation, and matched human
expert performance while requiring substantially less planning time. These
results highlight the potential of reinforcement learning to deliver clinically
viable, reproducible, and efficient cryoablation plans.
comment: Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
♻ ☆ From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes NeurIPS 2025
3D visual grounding has made notable progress in localizing objects within
complex 3D scenes. However, grounding referring expressions beyond objects in
3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a
holistic 3D visual grounding benchmark consisting of 2,886 referring
expression-3D bounding box pairs spanning four different grounding levels:
human-activity areas, unoccupied space beyond objects, individual objects in
the scene, and fine-grained object parts. We assess a range of state-of-the-art
3D visual grounding methods alongside large language models (LLMs) and
multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that
space-level and part-level visual grounding pose the greatest challenges:
space-level tasks require a more comprehensive spatial reasoning ability, for
example, modeling distances and spatial relations within 3D space, while
part-level tasks demand fine-grained perception of object composition. Even the
best performance model, OpenAI o4-mini, achieves only 23.00% accuracy on
space-level tasks and 31.46% on part-level tasks, significantly lower than its
performance on area-level and object-level tasks. These findings underscore a
critical gap in current models' capacity to understand and reason about 3D
scenes beyond object-level semantics.
comment: NeurIPS 2025 (Datasets and Benchmarks). Project page:
https://anywhere-3d.github.io/
♻ ★ Polyline Path Masked Attention for Vision Transformer
Global dependency modeling and spatial position modeling are two core issues
of the foundational architecture design in current deep learning frameworks.
Recently, Vision Transformers (ViTs) have achieved remarkable success in
computer vision, leveraging the powerful global dependency modeling capability
of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its
significant potential in natural language processing tasks by explicitly
modeling the spatial adjacency prior through the structured mask. In this
paper, we propose Polyline Path Masked Attention (PPMA) that integrates the
self-attention mechanism of ViTs with an enhanced structured mask of Mamba2,
harnessing the complementary strengths of both architectures. Specifically, we
first ameliorate the traditional structured mask of Mamba2 by introducing a 2D
polyline path scanning strategy and derive its corresponding structured mask,
polyline path mask, which better preserves the adjacency relationships among
image tokens. Notably, we conduct a thorough theoretical analysis on the
structural characteristics of the proposed polyline path mask and design an
efficient algorithm for the computation of the polyline path mask. Next, we
embed the polyline path mask into the self-attention mechanism of ViTs,
enabling explicit modeling of spatial adjacency prior. Extensive experiments on
standard benchmarks, including image classification, object detection, and
segmentation, demonstrate that our model outperforms previous state-of-the-art
approaches based on both state-space models and Transformers. For example, our
proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K
semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%,
respectively. Code is available at https://github.com/zhongchenzhao/PPMA.
♻ ☆ Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge NeurIPS 2025
Facial personalization faces challenges to maintain identity fidelity without
disrupting the foundation model's prompt consistency. The mainstream
personalization models employ identity embedding to integrate identity
information within the attention mechanisms. However, our preliminary findings
reveal that identity embeddings compromise the effectiveness of other tokens in
the prompt, thereby limiting high prompt consistency and attribute-level
controllability. Moreover, by deactivating identity embedding, personalization
models still demonstrate the underlying foundation models' ability to control
facial attributes precisely. It suggests that such foundation models' knowledge
can be leveraged to cure the ill-aligned prompt consistency of personalization
models. Building upon these insights, we propose FreeCure, a framework that
improves the prompt consistency of personalization models with their latent
foundation models' knowledge. First, by setting a dual inference paradigm
with/without identity embedding, we identify attributes (e.g., hair,
accessories, etc.) for enhancements. Second, we introduce a novel
foundation-aware self-attention module, coupled with an inversion-based process
to bring well-aligned attribute information to the personalization process. Our
approach is training-free, and can effectively enhance a wide array of facial
attributes; and it can be seamlessly integrated into existing popular
personalization models based on both Stable Diffusion and FLUX. FreeCure has
consistently shown significant improvements in prompt consistency across these
facial personalization models while maintaining the integrity of their original
identity fidelity.
comment: Accepted to NeurIPS 2025
♻ ☆ Global Prompt Refinement with Non-Interfering Attention Masking for One-Shot Federated Learning NeurIPS'25
Federated Prompt Learning (FPL) enables communication-efficient adaptation by
tuning lightweight prompts on top of frozen pre-trained models. Existing FPL
methods typically rely on global information, which is only available after the
second training round, to facilitate collaboration among client models.
Therefore, they are inherently dependent on multi-round communication to fully
exhibit their strengths. Moreover, existing one-shot federated learning methods
typically focus on fitting seen tasks, but lack cross-task generalization. To
bridge this gap, we propose the Global Prompt Refinement with Non-Interfering
Attention Masking (GPR-NIAM) method for one-shot FPL. The core idea is to
design a masking mechanism that restricts excessive interaction between the
original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves
this through the collaboration of two key modules. Firstly, the attention
isolation module suppresses attention from the learnable prompt tokens to the
original text tokens, and reweights the reverse attention which preserves
generalization across tasks. Secondly, the cross-silo collaborative refinement
module integrates decentralized visual knowledge into a unified base and
calibrates the global prompt through multi-source cross-modal knowledge
alignment, further mitigating the inconsistency caused by data heterogeneity.
Extensive experiments conducted on ten benchmark datasets under two tasks show
that GPR-NIAM outperforms eight state-of-the-art methods in both class-level
and domain-level generalization.
comment: NeurIPS'25 accepted
♻ ☆ Exploring Cross-Modal Flows for Few-Shot Learning
Aligning features from different modalities, is one of the most fundamental
challenges for cross-modal tasks. Although pre-trained vision-language models
can achieve a general alignment between image and text, they often require
parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT
methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively
fine-tune a subset of parameters, which can slightly adjust either visual or
textual features, and avoid overfitting. In this paper, we are the first to
highlight that all existing PEFT methods perform one-step adjustment. It is
insufficient for complex (or difficult) datasets, where features of different
modalities are highly entangled. To this end, we propose the first
model-agnostic multi-step adjustment approach by learning a cross-modal
velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the
correspondence between categories during training, we first utilize a fixed
coupling strategy. Then, we propose a noise augmentation strategy to alleviate
the data scarcity issue. Finally, we design an early-stopping solver, which
terminates the transformation process earlier, improving both efficiency and
accuracy. Compared with one-step PEFT methods, FMA has the multi-step
rectification ability to achieve more precise and robust alignment. Extensive
results have demonstrated that FMA can consistently yield significant
performance gains across various benchmarks and backbones, particularly on
challenging datasets.
comment: 13 pages, 6 figures
♻ ☆ CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
The sparse cross-modality detector offers more advantages than its
counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of
adaptability for downstream tasks and computational cost savings. However,
existing sparse detectors overlook the quality of token representation, leaving
it with a sub-optimal foreground quality and limited performance. In this
paper, we identify that the geometric structure preserved and the class
distribution are the key to improving the performance of the sparse detector,
and propose a Sparse Selector (SS). The core module of SS is Ray-Aware
Supervision (RAS), which preserves rich geometric information during the
training stage, and Class-Balanced Supervision, which adaptively reweights the
salience of class semantics, ensuring that tokens associated with small objects
are retained during token sampling. Thereby, outperforming other sparse
multi-modal detectors in the representation of tokens. Additionally, we design
Ray Positional Encoding (Ray PE) to address the distribution differences
between the LiDAR modality and the image. Finally, we integrate the
aforementioned module into an end-to-end sparse multi-modality detector, dubbed
CrossRay3D. Experiments show that, on the challenging nuScenes benchmark,
CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS,
while running 1.84 faster than other leading methods. Moreover, CrossRay3D
demonstrates strong robustness even in scenarios where LiDAR or camera data are
partially or entirely missing.
comment: 13 pages
♻ ☆ Class-wise Balancing Data Replay for Federated Class-Incremental Learning NeurIPS'25
Federated Class Incremental Learning (FCIL) aims to collaboratively process
continuously increasing incoming tasks across multiple clients. Among various
approaches, data replay has become a promising solution, which can alleviate
forgetting by reintroducing representative samples from previous tasks.
However, their performance is typically limited by class imbalance, both within
the replay buffer due to limited global awareness and between replayed and
newly arrived classes. To address this issue, we propose a class wise balancing
data replay method for FCIL (FedCBDR), which employs a global coordination
mechanism for class-level memory construction and reweights the learning
objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has
two key components: 1) the global-perspective data replay module reconstructs
global representations of prior task in a privacy-preserving manner, which then
guides a class-aware and importance-sensitive sampling strategy to achieve
balanced replay; 2) Subsequently, to handle class imbalance across tasks, the
task aware temperature scaling module adaptively adjusts the temperature of
logits at both class and instance levels based on task dynamics, which reduces
the model's overconfidence in majority classes while enhancing its sensitivity
to minority classes. Experimental results verified that FedCBDR achieves
balanced class-wise sampling under heterogeneous data distributions and
improves generalization under task imbalance between earlier and recent tasks,
yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.
comment: NeurIPS'25 Accepted, Oral
♻ ☆ When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
Xianzheng Ma, Brandon Smart, Yash Bhalgat, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu
As large language models (LLMs) evolve, their integration with 3D spatial
data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for
understanding and interacting with physical spaces. This survey provides a
comprehensive overview of the methodologies enabling LLMs to process,
understand, and generate 3D data. Highlighting the unique advantages of LLMs,
such as in-context learning, step-by-step reasoning, open-vocabulary
capabilities, and extensive world knowledge, we underscore their potential to
significantly advance spatial comprehension and interaction within embodied
Artificial Intelligence (AI) systems. Our investigation spans various 3D data
representations, from point clouds to Neural Radiance Fields (NeRFs). It
examines their integration with LLMs for tasks such as 3D scene understanding,
captioning, question-answering, and dialogue, as well as LLM-based agents for
spatial reasoning, planning, and navigation. The paper also includes a brief
review of other methods that integrate 3D and language. The meta-analysis
presented in this paper reveals significant progress yet underscores the
necessity for novel approaches to harness the full potential of 3D-LLMs. Hence,
with this paper, we aim to chart a course for future research that explores and
expands the capabilities of 3D-LLMs in understanding and interacting with the
complex 3D world. To support this survey, we have established a project page
where papers related to our topic are organized and listed:
https://github.com/ActiveVisionLab/Awesome-LLM-3D.
comment: 2nd version update to Jun.2025
♻ ☆ Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Writing is a universal cultural technology that reuses vision for symbolic
communication. Humans display striking resilience: we readily recognize words
even when characters are fragmented, fused, or partially occluded. This paper
investigates whether advanced vision language models (VLMs) share this
resilience. We construct two psychophysics inspired benchmarks across distinct
writing systems, Chinese logographs and English alphabetic words, by splicing,
recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli
for models while remaining legible to humans. Despite strong performance on
clean text, contemporary VLMs show a severe drop under these perturbations,
frequently producing unrelated or incoherent outputs. The pattern suggests a
structural limitation: models heavily leverage generic visual invariances but
under rely on compositional priors needed for robust literacy. We release
stimuli generation code, prompts, and evaluation protocols to facilitate
transparent replication and follow up work. Our findings motivate architectures
and training strategies that encode symbol segmentation, composition, and
binding across scripts, and they delineate concrete challenges for deploying
multimodal systems in education, accessibility, cultural heritage, and
security.
comment: Agent4Science 2025 Spotlight
♻ ☆ Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Learning general-purpose reasoning capabilities has long been a challenging
problem in AI. Recent research in large language models (LLMs), such as
DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can
enable pre-trained LLMs to develop reasoning capabilities using simple
question-answer pairs. In this paper, we aim to train visual language models
(VLMs) to perform reasoning on image data through reinforcement learning and
visual question-answer pairs, without any explicit chain-of-thought (CoT)
supervision. Our findings indicate that simply applying reinforcement learning
to a VLM -- by prompting the model to produce a reasoning chain before
providing an answer -- can lead the model to develop shortcuts from easy
questions, thereby reducing its ability to generalize across unseen data
distributions. We argue that the key to mitigating shortcut learning is to
encourage the model to interpret images prior to reasoning. Therefore, we train
the model to adhere to a caption-reason-answer output format: initially
generating a detailed caption for an image, followed by constructing an
extensive reasoning chain. When trained on 273K CoT-free visual question-answer
pairs and using only reinforcement learning, our model, named Visionary-R1,
outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and
Gemini-1.5-Pro, on multiple visual reasoning benchmarks.
♻ ☆ Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation
The Object Goal Navigation (ObjectNav) task challenges agents to locate a
specified object in an unseen environment by imagining unobserved regions of
the scene. Prior approaches rely on deterministic and discriminative models to
complete semantic maps, overlooking the inherent uncertainty in indoor layouts
and limiting their ability to generalize to unseen environments. In this work,
we propose GOAL, a generative flow-based framework that models the semantic
distribution of indoor environments by bridging observed regions with
LLM-enriched full-scene semantic maps. During training, spatial priors inferred
from large language models (LLMs) are encoded as two-dimensional Gaussian
fields and injected into target maps, distilling rich contextual knowledge into
the flow model and enabling more generalizable completions. Extensive
experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D
and Gibson, and shows strong generalization in transfer settings to HM3D. Codes
and pretrained models are available at https://github.com/Badi-Li/GOAL.
♻ ☆ MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal
reasoning tasks through enhanced chain-of-thought capabilities. However, this
advancement also introduces novel safety risks, as these models become
increasingly vulnerable to harmful multimodal prompts that can trigger
unethical or unsafe behaviors. Existing safety alignment approaches, primarily
designed for unimodal language models, fall short in addressing the complex and
nuanced threats posed by multimodal inputs. Moreover, current safety datasets
lack the fine-grained, policy-grounded reasoning required to robustly align
reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality
Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align
supports fine-grained, deliberative reasoning over standardized safety policies
across both vision and text modalities. Our data generation pipeline emphasizes
multimodal diversity, policy-grounded reasoning, and rigorous quality filtering
using strong multimodal judges. Extensive experiments demonstrate that
fine-tuning VLMs on MSR-Align substantially improves robustness against both
textual and vision-language jailbreak attacks, while preserving or enhancing
general reasoning performance. MSR-Align provides a scalable and effective
foundation for advancing the safety alignment of reasoning-capable VLMs. Our
dataset is made publicly available at
https://huggingface.co/datasets/Leigest/MSR-Align.
♻ ☆ RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang
Existing end-to-end autonomous driving (AD) algorithms typically follow the
Imitation Learning (IL) paradigm, which faces challenges such as causal
confusion and an open-loop gap. In this work, we propose RAD, a 3DGS-based
closed-loop Reinforcement Learning (RL) framework for end-to-end Autonomous
Driving. By leveraging 3DGS techniques, we construct a photorealistic digital
replica of the real physical world, enabling the AD policy to extensively
explore the state space and learn to handle out-of-distribution scenarios
through large-scale trial and error. To enhance safety, we design specialized
rewards to guide the policy in effectively responding to safety-critical events
and understanding real-world causal relationships. To better align with human
driving behavior, we incorporate IL into RL training as a regularization term.
We introduce a closed-loop evaluation benchmark consisting of diverse,
previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves
stronger performance in most closed-loop metrics, particularly exhibiting a 3x
lower collision rate. Abundant closed-loop results are presented in the
supplementary material. Code is available at https://github.com/hustvl/RAD for
facilitating future research.
comment: Code: https://github.com/hustvl/RAD
♻ ☆ SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang
World models allow agents to simulate the consequences of actions in imagined
environments for planning, control, and long-horizon decision-making. However,
existing autoregressive world models struggle with visually coherent
predictions due to disrupted spatial structure, inefficient decoding, and
inadequate motion modeling. In response, we propose \textbf{S}cale-wise
\textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt
(\textbf{SAMPO}), a hybrid framework that combines visual autoregressive
modeling for intra-frame generation with causal modeling for next-frame
generation. Specifically, SAMPO integrates temporal causal decoding with
bidirectional spatial attention, which preserves spatial locality and supports
parallel decoding within each scale. This design significantly enhances both
temporal consistency and rollout efficiency. To further improve dynamic scene
understanding, we devise an asymmetric multi-scale tokenizer that preserves
spatial details in observed frames and extracts compact dynamic representations
for future frames, optimizing both memory usage and model performance.
Additionally, we introduce a trajectory-aware motion prompt module that injects
spatiotemporal cues about object and robot trajectories, focusing attention on
dynamic regions and improving temporal consistency and physical realism.
Extensive experiments show that SAMPO achieves competitive performance in
action-conditioned video prediction and model-based control, improving
generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's
zero-shot generalization and scaling behavior, demonstrating its ability to
generalize to unseen tasks and benefit from larger model sizes.
comment: 22 pages,15 figures
♻ ☆ Learning Collaborative Knowledge with Multimodal Representation for Polyp Re-Identification
Colonoscopic Polyp Re-Identification aims to match the same polyp from a
large gallery with images from different views taken using different cameras,
which plays an important role in the prevention and treatment of colorectal
cancer in computer-aided diagnosis. However, traditional methods for object
ReID directly adopting CNN models trained on the ImageNet dataset usually
produce unsatisfactory retrieval performance on colonoscopic datasets due to
the large domain gap. Worsely, these solutions typically learn unimodal modal
representations on the basis of visual samples, which fails to explore
complementary information from other different modalities. To address this
challenge, we propose a novel Deep Multimodal Collaborative Learning framework
named DMCL for polyp re-identification, which can effectively encourage
multimodal knowledge collaboration and reinforce generalization capability in
medical scenarios. On the basis of it, a dynamic multimodal feature fusion
strategy is introduced to leverage the optimized visual-text representations
for multimodal fusion via end-to-end training. Experiments on the standard
benchmarks show the benefits of the multimodal setting over state-of-the-art
unimodal ReID models, especially when combined with the collaborative
multimodal fusion strategy. The code is publicly available at
https://github.com/JeremyXSC/DMCL.
♻ ☆ GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization
Image geolocalization aims to predict the geographic location of images
captured anywhere on Earth, but its global nature presents significant
challenges. Current evaluation methodologies suffer from two major limitations.
First, data leakage: advanced approaches often rely on large vision-language
models (LVLMs) to predict image locations, yet these models are frequently
pretrained on the test datasets, compromising the accuracy of evaluating a
model's actual geolocalization capability. Second, existing metrics primarily
rely on exact geographic coordinates to assess predictions, which not only
neglects the reasoning process but also raises privacy concerns when user-level
location data is required. To address these issues, we propose GeoArena, a
first open platform for evaluating LVLMs on worldwide image geolocalization
tasks, offering true in-the-wild and human-centered benchmarking. GeoArena
enables users to upload in-the-wild images for a more diverse evaluation
corpus, and it leverages pairwise human judgments to determine which model
output better aligns with human expectations. Our platform has been deployed
online for two months, during which we collected over thousands voting records.
Based on this data, we conduct a detailed analysis and establish a leaderboard
of different LVLMs on the image geolocalization task. GeoArena has been
open-sourced to support future research.
♻ ☆ A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking NeurIPS 2025
Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, Konrad Schindler
The real world is dynamic, yet most image fusion methods process static
frames independently, ignoring temporal correlations in videos and leading to
flickering and temporal inconsistency. To address this, we propose Unified
Video Fusion (UniVF), a novel and unified framework for video fusion that
leverages multi-frame learning and optical flow-based feature warping for
informative, temporally coherent video fusion. To support its development, we
also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive
benchmark covering four video fusion tasks: multi-exposure, multi-focus,
infrared-visible, and medical fusion. VF-Bench provides high-quality,
well-aligned video pairs obtained through synthetic data generation and
rigorous curation from existing datasets, with a unified evaluation protocol
that jointly assesses the spatial quality and temporal consistency of video
fusion. Extensive experiments show that UniVF achieves state-of-the-art results
across all tasks on VF-Bench. Project page: https://vfbench.github.io.
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection AAAI 25
View transformation robustness (VTR) is critical for deep-learning-based
multi-view 3D object reconstruction models, which indicates the methods'
stability under inputs with various view transformations. However, existing
research seldom focused on view transformation robustness in multi-view 3D
object reconstruction. One direct way to improve the models' VTR is to produce
data with more view transformations and add them to model training. Recent
progress on large vision models, particularly Stable Diffusion models, has
provided great potential for generating 3D models or synthesizing novel view
images with only a single image input. Directly deploying these models at
inference consumes heavy computation resources and their robustness to view
transformations is not guaranteed either. To fully utilize the power of Stable
Diffusion models without extra inference computation burdens, we propose to
generate novel views with Stable Diffusion models for better view
transformation robustness. Instead of synthesizing random views, we propose a
reconstruction error-guided view selection method, which considers the
reconstruction errors' spatial distribution of the 3D predictions and chooses
the views that could cover the reconstruction errors as much as possible. The
methods are trained and tested on sets with large view transformations to
validate the 3D reconstruction models' robustness to view transformations.
Extensive experiments demonstrate that the proposed method can outperform
state-of-the-art 3D reconstruction methods and other view transformation
robustness comparison methods. Code is available at:
https://github.com/zqyq/VTR.
comment: AAAI 25
♻ ☆ ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction
Bed-related falls remain a major source of injury in hospitals and long-term
care facilities, yet many commercial alarms trigger only after a patient has
already left the bed. We show that early bed-exit intent can be predicted using
only one low-cost load cell mounted under a bed leg. The resulting load signals
are first converted into a compact set of complementary images: an RGB line
plot that preserves raw waveforms and three texture maps-recurrence plot,
Markov transition field, and Gramian angular field-that expose higher-order
dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that
processes the line plot and texture maps in parallel and fuses them through
cross-attention to learn data-driven modality weights. To provide a realistic
benchmark, we collected six months of continuous data from 95 beds in a
long-term-care facility. On this real-world dataset ViFusionTST reaches an
accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D
time-series baselines across F1, recall, accuracy, and AUPRC. The results
demonstrate that image-based fusion of load-sensor signals for time series
classification is a practical and effective solution for real-time,
privacy-preserving fall prevention.
♻ ☆ Implicit Neural Compression of Point Clouds
Point clouds have gained prominence across numerous applications due to their
ability to accurately represent 3D objects and scenes. However, efficiently
compressing unstructured, high-precision point cloud data remains a significant
challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression
framework that leverages implicit neural representations (INRs) to encode both
geometry and attributes of dense point clouds. Our approach employs two
coordinate-based neural networks: one maps spatial coordinates to voxel
occupancy, while the other maps occupied voxels to their attributes, thereby
implicitly representing the geometry and attributes of a voxelized point cloud.
The encoder quantizes and compresses network parameters alongside auxiliary
information required for reconstruction, while the decoder reconstructs the
original point cloud by inputting voxel coordinates into the neural networks.
Furthermore, we extend our method to dynamic point cloud compression through
techniques that reduce temporal redundancy, including a 4D spatio-temporal
representation termed 4D-NeRC$^3$. Experimental results validate the
effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms
octree-based G-PCC standard and existing INR-based methods. For dynamic point
clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared
to the latest G-PCC and V-PCC standards, while matching state-of-the-art
learning-based methods. It also demonstrates competitive performance in joint
geometry and attribute compression.
♻ ☆ LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
Referential grounding in outdoor driving scenes is challenging due to large
scene variability, many visually similar objects, and dynamic elements that
complicate resolving natural-language references (e.g., "the black car on the
right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf
vision-language models for fine-grained attribute extraction with large
language models for symbolic reasoning. LLM-RG processes an image and a
free-form referring expression by using an LLM to extract relevant object types
and attributes, detecting candidate regions, generating rich visual descriptors
with a VLM, and then combining these descriptors with spatial metadata into
natural-language prompts that are input to an LLM for chain-of-thought
reasoning to identify the referent's bounding box. Evaluated on the Talk2Car
benchmark, LLM-RG yields substantial gains over both LLM and VLM-based
baselines. Additionally, our ablations show that adding 3D spatial cues further
improves grounding. Our results demonstrate the complementary strengths of VLMs
and LLMs, applied in a zero-shot manner, for robust outdoor referential
grounding.
comment: Human-aware Embodied AI Workshop @ IROS 2025
♻ ☆ NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?
The evaluation of Vision-Language-Action (VLA) agents is hindered by the
coarse, end-task success metric that fails to provide precise skill diagnosis
or measure robustness to real-world perturbations. This challenge is
exacerbated by a fragmented data landscape that impedes reproducible research
and the development of generalist models. To address these limitations, we
introduce NEBULA, a unified ecosystem for single-arm manipulation that enables
diagnostic and reproducible evaluation. NEBULA features a novel dual-axis
evaluation protocol that combines fine-grained capability tests for precise
skill diagnosis with systematic stress tests that measure robustness. A
standardized API and a large-scale, aggregated dataset are provided to reduce
fragmentation and support cross-dataset training and fair comparison. Using
NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities
such as spatial reasoning and dynamic adaptation, which are consistently
obscured by conventional end-task success metrics. By measuring both what an
agent can do and when it does so reliably, NEBULA provides a practical
foundation for robust, general-purpose embodied agents.
comment: Homepage: https://vulab-ai.github.io/NEBULA-Alpha/
♻ ☆ SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Existing research on 3D Large Language Models (LLMs) still struggles to
achieve grounded question-answering, primarily due to the under-exploration of
the mechanism of human-like scene-object grounded reasoning. This paper bridges
the gap by presenting a novel framework. We first introduce a grounded
Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex
reasoning task into simpler and manageable problems, and building corresponding
visual clues based on multimodal expert modules. To enable such a method, we
develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset,
consisting of 185K high-quality instances. Extensive experiments across various
complex 3D scene reasoning benchmarks demonstrate that our new framework
achieves strong performance with high grounding-QA coherence. To the best of
our knowledge, this is the first successful application of CoT reasoning to 3D
scene understanding, enabling step-by-step human-like reasoning and showing
potential for extension to broader 3D scene understanding scenarios.
comment: Project page: https://scenecot.github.io/